[00:13:38] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:48:38] PROBLEM - SSH on furud.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:51:08] (03PS1) 10RLazarus: deployment_server: Add keyholder identity for scap [puppet] - 10https://gerrit.wikimedia.org/r/790455 (https://phabricator.wikimedia.org/T307351) [00:52:30] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:53:01] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35166/console" [puppet] - 10https://gerrit.wikimedia.org/r/790455 (https://phabricator.wikimedia.org/T307351) (owner: 10RLazarus) [00:54:50] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [00:56:00] (JobUnavailable) firing: (3) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:56:52] 10SRE, 10SRE-Access-Requests, 10Scap, 10Patch-For-Review: Add new user identity to Keyholder for scap - https://phabricator.wikimedia.org/T307351 (10RLazarus) Thanks both @joe and @thcipriani for the ping -- agreed clinic duty is as good a route for this as any. Keys committed to private puppet in c844bec... [00:57:36] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220510T0100) [01:01:20] (03CR) 10Dzahn: [C: 03+1] "thank you for doing this. this looks all good to me. Maybe just one comment, given the current open task to debug this for T307907, maybe " [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [01:22:22] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:24:38] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [01:36:52] PROBLEM - Check systemd state on gitlab2001 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:57:45] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:03:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:03:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:04:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_navigationtiming_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.11 [core] (wmf/1.39.0-wmf.11) - 10https://gerrit.wikimedia.org/r/790456 [02:07:50] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.11 [core] (wmf/1.39.0-wmf.11) - 10https://gerrit.wikimedia.org/r/790456 (owner: 10TrainBranchBot) [02:23:46] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.11 [core] (wmf/1.39.0-wmf.11) - 10https://gerrit.wikimedia.org/r/790456 (owner: 10TrainBranchBot) [02:30:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:31:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:31:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:51:02] RECOVERY - SSH on furud.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:55:45] 10SRE, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531 (10Andrew) [03:01:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:17:30] RECOVERY - SSH on analytics1061.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:01:26] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:21:48] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:30:42] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [04:50:49] (03PS8) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [04:56:00] (JobUnavailable) firing: (3) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:24] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10Joe) Hey @thcipriani that would be correct, although I need to do i... [05:14:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1172 to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27775 and previous config saved to /var/cache/conftool/dbconfig/20220510-051429-marostegui.json [05:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:35] T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546 [05:18:17] !log dbmaint s3@codfw T307906 [05:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:22] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [05:18:36] !log Rename revision_actor_temp on db2109 T307906 [05:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:57] (03PS1) 10Marostegui: db2109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/790567 [05:29:42] (03CR) 10Marostegui: [C: 03+2] db2109: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/790567 (owner: 10Marostegui) [05:36:04] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:43:00] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [05:57:46] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:00:04] kormat, marostegui, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220510T0600). [06:30:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] envoy: manage strip_matching_host_port setting and enable on thanos-fe [puppet] - 10https://gerrit.wikimedia.org/r/769749 (https://phabricator.wikimedia.org/T300119) (owner: 10Herron) [06:32:25] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hieradata: set thumbor probe timeout [puppet] - 10https://gerrit.wikimedia.org/r/790292 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [06:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:47:43] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM bast3005.wikimedia.org [06:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:45] (JobUnavailable) firing: (3) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:51:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM bast3005.wikimedia.org [06:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:45] (JobUnavailable) resolved: (3) Reduced availability for job gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:56:12] PROBLEM - SSH on furud.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:56:34] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [06:56:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [06:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:53] !log ayounsi@cumin1001 START - Cookbook sre.network.cf [06:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:56] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [06:56:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1, awight, Urbanecm, and taavi: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220510T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:04:32] Not a deploy request but could anyone take a review of this patch https://gerrit.wikimedia.org/r/q/Id15581f1df3e9b106b60357c1a697bb4296ff8cb ? [07:04:55] thanks a lot, plans to schedule several days later [07:05:16] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:08:46] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in edge sites to a fixed KVM machine type - https://phabricator.wikimedia.org/T307423 (10MoritzMuehlenhoff) [07:08:59] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in esams to a fixed KVM machine type - https://phabricator.wikimedia.org/T307424 (10MoritzMuehlenhoff) [07:09:01] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in drmrs to a fixed KVM machine type - https://phabricator.wikimedia.org/T307427 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [07:09:15] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in esams to a fixed KVM machine type - https://phabricator.wikimedia.org/T307424 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [07:09:17] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in edge sites to a fixed KVM machine type - https://phabricator.wikimedia.org/T307423 (10MoritzMuehlenhoff) [07:10:18] 10SRE, 10Infrastructure-Foundations: Migrate Ganeti installations in edge sites to a fixed KVM machine type - https://phabricator.wikimedia.org/T307423 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff These are all done, unblocking the OS updates. I've also amended https://wikitech.wikimedia... [07:11:26] (03CR) 10RhinosF1: [C: 04-1] "acccording to the config diff, this enables uploads on some wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [07:11:59] koi: ^ [07:13:12] https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/11229/console [07:13:27] (03PS1) 10Muehlenhoff: Remove access for statwithlatte [puppet] - 10https://gerrit.wikimedia.org/r/790613 [07:14:40] thanks, looking [07:15:54] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Naray-ctr out of all services on: 531 hosts [07:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Naray-ctr out of all services on: 531 hosts [07:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:17] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Naray-ctr out of all services on: 1223 hosts [07:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Naray-ctr out of all services on: 1223 hosts [07:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:44] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:17:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for statwithlatte [puppet] - 10https://gerrit.wikimedia.org/r/790613 (owner: 10Muehlenhoff) [07:20:05] (03PS1) 10Slyngshede: OpenLDAP, move restart cronjob to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) [07:20:37] (03CR) 10jerkins-bot: [V: 04-1] OpenLDAP, move restart cronjob to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:22:15] (03CR) 10Muehlenhoff: OpenLDAP, move restart cronjob to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:23:47] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2040.codfw.wmnet [07:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:43] (03CR) 10Stang: Remove upload rights on wikis where local uploads are disabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [07:34:46] (03PS9) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [07:35:35] jouncebot: nowandnext [07:35:35] For the next 0 hour(s) and 24 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220510T0700) [07:35:35] In 5 hour(s) and 24 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220510T1300) [07:35:40] (03CR) 10jerkins-bot: [V: 04-1] Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [07:39:32] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2040.codfw.wmnet [07:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:47] (03CR) 10Ladsgroup: [C: 03+2] api: Add support for linksmigration in ApiQueryLinks [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790427 (https://phabricator.wikimedia.org/T304780) (owner: 10Jforrester) [07:44:02] I'll sneak into this deployment window with some minor config changes. [07:44:32] (03PS2) 10Slyngshede: OpenLDAP, move restart cronjob to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) [07:44:40] (03PS3) 10Awight: Watch for mapdata cache misses in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786289 (https://phabricator.wikimedia.org/T304813) [07:44:55] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786289 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [07:45:06] (03CR) 10jerkins-bot: [V: 04-1] OpenLDAP, move restart cronjob to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:45:20] (03PS1) 10Ladsgroup: Stop writing to revision_actor_temp table everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790618 (https://phabricator.wikimedia.org/T275246) [07:45:45] (03Merged) 10jenkins-bot: Watch for mapdata cache misses in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/786289 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [07:46:29] (03CR) 10Legoktm: Revert "Cache Badtitle 400s for 60s in varnish-fe" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769827 (owner: 10Legoktm) [07:46:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6001.drmrs.wmnet [07:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:47:52] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 387 probes of 759 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:48:24] !log awight@deploy1002 Synchronized wmf-config: Config: [[gerrit:786289|Watch for mapdata cache misses in production (T304813 T300712)]] (duration: 00m 50s) [07:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:33] T304813: Mapdata API should use the FlaggedRevs stable revision ParserCache when appropriate - https://phabricator.wikimedia.org/T304813 [07:48:33] T300712: Deploy versioned maps support to production - https://phabricator.wikimedia.org/T300712 [07:48:37] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 114 probes of 676 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:48:49] (03PS2) 10Ladsgroup: Stop writing to revision_actor_temp table everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790618 (https://phabricator.wikimedia.org/T275246) [07:49:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:49:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:49] (03PS2) 10Awight: Clean up unnecessary two-level setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787698 [07:49:54] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787698 (owner: 10Awight) [07:49:57] awight: let me know once you're done [07:50:07] Amir1: Sure, I can stop after this one. [07:50:16] nah, take your time [07:50:38] (03Merged) 10jenkins-bot: Clean up unnecessary two-level setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787698 (owner: 10Awight) [07:50:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:50:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:18] Amir1: btw, I *love* the deployment commands script, thank you! [07:51:30] (03PS3) 10Slyngshede: OpenLDAP, move restart cronjob to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) [07:51:36] Make sure to report issues using the link below :P [07:51:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:54] https://deploy-commands.toolforge.org/report-issues [07:52:30] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:52:38] oh I did see your easter egg, u can't roll me that easily [07:52:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6001.drmrs.wmnet [07:52:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:47] ugh [07:52:56] need to find new ways [07:54:06] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 11 probes of 759 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:54:52] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 56 probes of 676 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:55:23] (03PS10) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [07:55:31] (03Merged) 10jenkins-bot: api: Add support for linksmigration in ApiQueryLinks [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790427 (https://phabricator.wikimedia.org/T304780) (owner: 10Jforrester) [07:55:37] Amir1: it was kinda loud in the commit logs [07:55:43] !log awight@deploy1002 Synchronized wmf-config: Config: [[gerrit:787698|Clean up unnecessary two-level setting]] (duration: 00m 49s) [07:55:44] !log failover ganeti master for drmrs/B12 to ganeti6001 [07:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:55] Amir1: deployment is all yours, thanks! [07:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:56:01] Awesome. Thanks [07:56:22] (03CR) 10jerkins-bot: [V: 04-1] Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [07:56:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:02] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [07:57:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:57:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:34] !log installing qemu security updates [07:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:50] confirming the backport works as intended and doesn't break the api either, moving to the rest of infra [08:00:50] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.10/includes/api/ApiQueryLinks.php: Backport: [[gerrit:790427|api: Add support for linksmigration in ApiQueryLinks (T304780)]] (duration: 00m 49s) [08:00:50] PROBLEM - ganeti-wconfd running on ganeti6003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [08:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:55] T304780: Write code for enabling compat read for templatelinks and linktarget - https://phabricator.wikimedia.org/T304780 [08:00:58] (03PS3) 10Ladsgroup: Stop writing to revision_actor_temp table everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790618 (https://phabricator.wikimedia.org/T275246) [08:01:05] (03CR) 10Ladsgroup: [C: 03+2] Stop writing to revision_actor_temp table everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790618 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [08:01:57] (03Merged) 10jenkins-bot: Stop writing to revision_actor_temp table everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790618 (https://phabricator.wikimedia.org/T275246) (owner: 10Ladsgroup) [08:02:35] awight: you haven't rebased your changes [08:02:58] (03PS11) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [08:03:19] good morning [08:03:42] I can deploy it, first IS.php and then CommonSettings.php it seems [08:03:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:04:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:04:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:02] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:790618|Stop writing to revision_actor_temp table everywhere (T275246)]] (duration: 00m 48s) [08:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:07] T275246: Populate rev_actor and rev_comment_id - https://phabricator.wikimedia.org/T275246 [08:05:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:11] !log ladsgroup@deploy1002 Synchronized wmf-config: Config: [[gerrit:787698|Clean up unnecessary two-level setting]] (duration: 00m 48s) [08:06:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:26] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:07:55] (03PS1) 10Ladsgroup: Revert "Revert "Set arwiki to read new in templatelinks migration"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790430 [08:08:01] (03PS2) 10Ladsgroup: Revert "Revert "Set arwiki to read new in templatelinks migration"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790430 [08:08:04] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Set arwiki to read new in templatelinks migration"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790430 (owner: 10Ladsgroup) [08:08:50] (03Merged) 10jenkins-bot: Revert "Revert "Set arwiki to read new in templatelinks migration"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790430 (owner: 10Ladsgroup) [08:09:38] (03PS12) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [08:10:04] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:790430|Revert "Revert "Set arwiki to read new in templatelinks migration""]] (duration: 00m 48s) [08:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:12] hashar: for the train log check, you might see "Unknown column 'lt_namespace' in 'where clause'", that was me testing a fix of the backport [08:10:22] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:25] good ! :) [08:10:29] thx for the notifiication [08:10:33] (03CR) 10jerkins-bot: [V: 04-1] Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [08:10:41] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6003.drmrs.wmnet [08:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:07] (03PS13) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [08:11:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:11:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:04] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2041.codfw.wmnet [08:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:12:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:18:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:49] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2041.codfw.wmnet [08:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:56] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:19:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:42] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ganeti6003.drmrs.wmnet [08:20:44] (03PS4) 10Slyngshede: OpenLDAP, move restart cronjob to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) [08:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:30] (03PS14) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [08:26:45] (03PS5) 10Slyngshede: OpenLDAP, move restart cronjob to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) [08:26:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6002.drmrs.wmnet [08:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:29:25] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add 'timeout' override for service::catalog probes [puppet] - 10https://gerrit.wikimedia.org/r/790291 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [08:29:31] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set thumbor probe timeout [puppet] - 10https://gerrit.wikimedia.org/r/790292 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [08:29:45] (03PS5) 10Gergő Tisza: GrowthExperiments: Start mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T307844) (owner: 10Kosta Harlan) [08:31:16] (03CR) 10Gergő Tisza: "This is now ready to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T307844) (owner: 10Kosta Harlan) [08:31:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6002.drmrs.wmnet [08:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:15] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2042.codfw.wmnet [08:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:21] !log failover ganeti master for drmrs/B13 to ganeti6002 [08:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:33:55] (03PS6) 10Slyngshede: OpenLDAP, move restart cronjob to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) [08:37:01] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/790393 (owner: 10BCornwall) [08:38:07] (03CR) 10Jbond: "LGTM" [software/netbox-deploy] (2-11-12) - 10https://gerrit.wikimedia.org/r/790407 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [08:38:29] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2042.codfw.wmnet [08:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:18] (03PS1) 10Filippo Giunchedi: Fix thumbor probe timeout value [puppet] - 10https://gerrit.wikimedia.org/r/790624 (https://phabricator.wikimedia.org/T291946) [08:43:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti6004.drmrs.wmnet [08:43:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:27] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Start mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T307844) (owner: 10Kosta Harlan) [08:44:34] (03CR) 10Giuseppe Lavagetto: Add a cookbook for rolling reboot of k8s clusters (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [08:44:36] (03CR) 10Filippo Giunchedi: [C: 03+2] Fix thumbor probe timeout value [puppet] - 10https://gerrit.wikimedia.org/r/790624 (https://phabricator.wikimedia.org/T291946) (owner: 10Filippo Giunchedi) [08:46:00] RECOVERY - Disk space on ms-be1040 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1040&var-datasource=eqiad+prometheus/ops [08:46:01] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1040.eqiad.wmnet [08:46:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:27] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Start mailing list campaign on eswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T307844) (owner: 10Kosta Harlan) [08:48:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti6004.drmrs.wmnet [08:48:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:24] !log dbmaint s1@codfw T307906 [08:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:28] T307906: Drop revision_actor_temp in production - https://phabricator.wikimedia.org/T307906 [08:48:28] !log Rename revision_actor_temp on db2092 T307906 [08:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:50] (03PS1) 10Marostegui: db2092: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/790625 [08:50:40] (03CR) 10Marostegui: [C: 03+2] db2092: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/790625 (owner: 10Marostegui) [08:51:08] RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:14] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1040.eqiad.wmnet [08:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:33] 10SRE, 10Infrastructure-Foundations, 10netops: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10dcaro) > Is it possible we can allocate these IP addresses on the cloud switches, from the existing 192.168.4.0/24 range? That's ok yes, w... [08:53:22] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ores2003.codfw.wmnet with OS buster [08:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:44] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be2053.codfw.wmnet with OS bullseye [08:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:48] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host ms-be2053.codfw.wmnet with OS bullseye [08:59:52] (03PS6) 10Gergő Tisza: GrowthExperiments: Start mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T307844) (owner: 10Kosta Harlan) [09:04:09] (03PS1) 10Gergő Tisza: GrowthExperiments: End mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790626 (https://phabricator.wikimedia.org/T307985) [09:06:52] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:10] (03CR) 10Kosta Harlan: [C: 04-2] "LGTM, but -2 until we are ready to end the campaign." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790626 (https://phabricator.wikimedia.org/T307985) (owner: 10Gergő Tisza) [09:07:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2002.codfw.wmnet [09:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:29] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: drmrs: network configuration - https://phabricator.wikimedia.org/T283050 (10ayounsi) 05Open→03Resolved a:03ayounsi This is done. [09:10:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2002.codfw.wmnet [09:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:02] (03PS8) 10Filippo Giunchedi: swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:11:57] (03CR) 10Filippo Giunchedi: "Agreed re: Daniel's comment on logging_enabled. I've changed that now and I'm going to merge this change." [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:12:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [09:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [09:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:27] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: migrate stats_account cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/778485 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [09:14:46] (03CR) 10Gergő Tisza: GrowthExperiments: Start mailing list campaign on eswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T307844) (owner: 10Kosta Harlan) [09:14:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2002.codfw.wmnet [09:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:47] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) > If there is any kind of anycast with the k8s prefixes (same prefix advertised from multiple locations), we should als... [09:18:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Increase traffic on db1172 to test 10.6 T307546', diff saved to https://phabricator.wikimedia.org/P27776 and previous config saved to /var/cache/conftool/dbconfig/20220510-091812-marostegui.json [09:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:17] T307546: Migrate a wikidata DB to MariaDB 10.6 - https://phabricator.wikimedia.org/T307546 [09:18:26] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2003.codfw.wmnet with reason: host reimage [09:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:45] (JobUnavailable) firing: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:19:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2002.codfw.wmnet [09:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [09:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:22] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2003.codfw.wmnet with reason: host reimage [09:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [09:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:45] (JobUnavailable) resolved: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:24:46] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: swift-account-stats failures on thanos-swift - https://phabricator.wikimedia.org/T307907 (10fgiunchedi) Thank you folks for taking a look! I've merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/778485 (thanks @Zabe !) so at least the cronspam will stop... [09:25:34] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be2053.codfw.wmnet with reason: host reimage [09:25:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:32] (03CR) 10David Caro: openstack: make enc-cli authenticate via keystone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779899 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [09:28:43] (03CR) 10Giuseppe Lavagetto: Add a cookbook for rolling reboot of k8s clusters (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [09:28:56] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be2053.codfw.wmnet with reason: host reimage [09:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:17] (03PS1) 10Muehlenhoff: Remove LDAP access for jcarvalho [puppet] - 10https://gerrit.wikimedia.org/r/790628 [09:30:11] (03CR) 10Majavah: openstack: make enc-cli authenticate via keystone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779899 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [09:33:31] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for jcarvalho [puppet] - 10https://gerrit.wikimedia.org/r/790628 (owner: 10Muehlenhoff) [09:35:58] (03CR) 10Filippo Giunchedi: "I ran this change through "puppet catalog compiler" (with utils/pcc from the puppet repo, although there are other means too, see https://" [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T123456) (owner: 10Slyngshede) [09:36:06] (03PS2) 10Ayounsi: Update submodule and requirements for 2.11.12 [software/netbox-deploy] (2-11-12) - 10https://gerrit.wikimedia.org/r/790407 (https://phabricator.wikimedia.org/T296452) [09:37:43] jouncebot: next [09:37:43] In 3 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220510T1300) [09:37:50] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host centrallog2002.codfw.wmnet [09:37:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:06] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/netbox-deploy] (2-11-12) - 10https://gerrit.wikimedia.org/r/790407 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [09:40:30] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Update submodule and requirements for 2.11.12 [software/netbox-deploy] (2-11-12) - 10https://gerrit.wikimedia.org/r/790407 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [09:41:01] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:41:18] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:41:58] (03CR) 10David Caro: [C: 03+1] "Let me know when you want this merged!" [puppet] - 10https://gerrit.wikimedia.org/r/779899 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [09:42:03] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host centrallog2002.codfw.wmnet [09:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:28] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:42:44] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:46:10] (03CR) 10Majavah: openstack: make enc-cli authenticate via keystone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779899 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [09:47:42] (03PS4) 10Giuseppe Lavagetto: New service: image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [09:49:30] (03PS5) 10Vgutierrez: mtail::cache_haproxy: Add HAProxy SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/790298 (https://phabricator.wikimedia.org/T307898) [09:49:32] (03PS5) 10Slyngshede: Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) [09:51:17] 10SRE, 10Infrastructure-Foundations, 10netops: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10cmooney) @dcaro thanks! That "ip route show" output would work perfectly yes. Although I was suggesting maybe to add a route for 192.168.... [09:53:06] (03CR) 10jerkins-bot: [V: 04-1] Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:53:17] !log klausman@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [09:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:24] !log klausman@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 06s) [09:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:06] (03CR) 10David Caro: [C: 03+2] openstack: make enc-cli authenticate via keystone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779899 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [09:54:19] (03CR) 10jerkins-bot: [V: 04-1] mtail::cache_haproxy: Add HAProxy SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/790298 (https://phabricator.wikimedia.org/T307898) (owner: 10Vgutierrez) [09:55:09] (03CR) 10Majavah: openstack: make enc-cli authenticate via keystone (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779899 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [09:55:56] (03PS6) 10Vgutierrez: mtail::cache_haproxy: Add HAProxy SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/790298 (https://phabricator.wikimedia.org/T307898) [09:56:30] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores2003.codfw.wmnet with OS buster [09:56:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:41] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Upgrade Netbox-dev2002 to 2.11 [09:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:46] RECOVERY - SSH on furud.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:59:47] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [09:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:22] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Upgrade Netbox-dev2002 to 2.11 (duration: 01m 41s) [10:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:54] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [10:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:12] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:03:19] (03CR) 10Hnowlan: [C: 03+2] changeprop: Update beta cluster domain names to .cloud [deployment-charts] - 10https://gerrit.wikimedia.org/r/790416 (https://phabricator.wikimedia.org/T307862) (owner: 10Ebernhardson) [10:05:13] (03PS1) 10Muehlenhoff: Add a note on the home for systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/790633 [10:05:49] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [10:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:12] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 1.725e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11 [10:07:28] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [10:07:52] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [10:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:11] (03CR) 10Muehlenhoff: scap: add new `scap` user to deployment hosts and scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [10:09:03] (03Merged) 10jenkins-bot: changeprop: Update beta cluster domain names to .cloud [deployment-charts] - 10https://gerrit.wikimedia.org/r/790416 (https://phabricator.wikimedia.org/T307862) (owner: 10Ebernhardson) [10:10:01] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "diffConfig looks good, should be okay to deploy later today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787699 (https://phabricator.wikimedia.org/T296759) (owner: 10Awight) [10:10:21] (03CR) 10Lucas Werkmeister (WMDE): Enable versioned maps everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788347 (https://phabricator.wikimedia.org/T300712) (owner: 10Awight) [10:11:22] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "LGTM, should be good for deployment later today (I won’t insist on fixing the commit message)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787690 (https://phabricator.wikimedia.org/T306867) (owner: 10Awight) [10:11:30] thanos rule alerts is me, prometheus reboots in progress [10:11:33] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [10:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:42] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [10:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:22] (03CR) 10Muehlenhoff: "Two remaining things, otherwise looks fine to me" [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [10:14:30] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:15:39] (03CR) 10David Caro: [C: 03+1] "LGTM, mostly nits, the permissions one is worth checking before merging though." [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [10:16:36] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:16:54] PROBLEM - Maps - OSM synchronization lag - codfw on alert1001 is CRITICAL: 4.296e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=12 [10:16:57] (03PS1) 10Klausman: hiera: Use celery v5 on ores2003 [puppet] - 10https://gerrit.wikimedia.org/r/790634 (https://phabricator.wikimedia.org/T303801) [10:17:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [10:17:42] (03PS2) 10Klausman: hiera: Use celery v5 on ores2003 [puppet] - 10https://gerrit.wikimedia.org/r/790634 (https://phabricator.wikimedia.org/T303801) [10:17:46] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [10:17:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:54] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 1.726e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11 [10:19:02] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [10:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:28] (ThanosRuleHighRuleEvaluationFailures) firing: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [10:21:12] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35175/console" [puppet] - 10https://gerrit.wikimedia.org/r/790634 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [10:23:12] 10SRE, 10Infrastructure-Foundations, 10netops: Finalise design extension of WMCS networks to new cloudsw in Eqiad rows E/F - https://phabricator.wikimedia.org/T304989 (10dcaro) > Let me know what you find on the /22 question. It's ok to use /24 for each, no problem there, the /22 I think was just to scope t... [10:25:28] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [10:25:43] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [10:25:45] (03PS1) 10Jbond: netbox: 3.1 -> 3.2 add migration script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790636 [10:25:58] (ThanosRuleHighRuleEvaluationFailures) resolved: Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [10:26:14] (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: Use celery v5 on ores2003 [puppet] - 10https://gerrit.wikimedia.org/r/790634 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [10:26:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2005.wikimedia.org [10:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:29] (03CR) 10jerkins-bot: [V: 04-1] netbox: 3.1 -> 3.2 add migration script [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790636 (owner: 10Jbond) [10:27:54] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:27:59] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be2053.codfw.wmnet with OS bullseye [10:28:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:03] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host ms-be2053.codfw.wmnet with OS bullseye completed: - ms-be2053 (**PASS**) - Downtim... [10:29:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2005.wikimedia.org [10:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2006.wikimedia.org [10:30:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2006.wikimedia.org [10:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:48] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:36:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1003.wikimedia.org [10:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:11] (03CR) 10Michael Große: [C: 03+1] Configure wgLexemeLexicalCategoryItemIds on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790398 (https://phabricator.wikimedia.org/T298150) (owner: 10Lucas Werkmeister (WMDE)) [10:37:47] (03PS6) 10Slyngshede: Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) [10:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:38:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1003.wikimedia.org [10:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:53] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) [10:40:04] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) p:05Triage→03Medium [10:41:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Add kubernetes admin credentials to cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/789808 (owner: 10JMeybohm) [10:42:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1004.wikimedia.org [10:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1004.wikimedia.org [10:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:12] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:46:50] (03CR) 10Lucas Werkmeister (WMDE): Enable versioned maps everywhere (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788347 (https://phabricator.wikimedia.org/T300712) (owner: 10Awight) [10:51:11] (03CR) 10WMDE-Fisch: [C: 03+1] Enable versioned maps everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788347 (https://phabricator.wikimedia.org/T300712) (owner: 10Awight) [10:52:09] (03PS1) 10Muehlenhoff: Enable ganeti3 component in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/790643 (https://phabricator.wikimedia.org/T307997) [10:52:35] (03PS7) 10Slyngshede: Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) [10:53:06] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:54:11] (03PS1) 10Roman Stolar: Update copyrights. [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/790645 (https://phabricator.wikimedia.org/T307398) [10:54:41] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/790643 (https://phabricator.wikimedia.org/T307997) (owner: 10Muehlenhoff) [10:55:22] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [10:58:44] (03PS5) 10Hnowlan: service: add image-suggestion ingress service [puppet] - 10https://gerrit.wikimedia.org/r/788753 (https://phabricator.wikimedia.org/T304891) [10:58:58] (03PS1) 10Cathal Mooney: List IPs reserved for top-of-rack in cloud ceph hosts profile [puppet] - 10https://gerrit.wikimedia.org/r/790648 (https://phabricator.wikimedia.org/T304989) [11:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:46] (03PS8) 10Slyngshede: Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) [11:04:16] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35180/console" [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:05:32] (03PS1) 10Sergio Gimeno: Account creation: update live campaigns config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 [11:05:40] (03CR) 10Hnowlan: [C: 03+2] changeprop: add sampling configuration, set num_workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/767080 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [11:05:44] !log temporarily install bpfcc-tools on kubernetes1013 (T306181) [11:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:50] T306181: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 [11:08:29] (03CR) 10Sergio Gimeno: "Patch to remove configs for GLAM campaigns since they are over. Adds skipWelcomeSurvey => true for existing social campaign to keep the ex" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790650 (owner: 10Sergio Gimeno) [11:09:42] PROBLEM - SSH on wtp1048.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:09:46] (03PS11) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) [11:09:55] (03Merged) 10jenkins-bot: changeprop: add sampling configuration, set num_workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/767080 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [11:10:14] (03CR) 10Jaime Nuche: [C: 03+1] Add a note on the home for systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/790633 (owner: 10Muehlenhoff) [11:12:49] (03CR) 10WhitePhosphorus: "Generally LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [11:12:51] (03CR) 10Jaime Nuche: [C: 04-1] scap: add new `scap` user to deployment hosts and scap targets (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [11:16:06] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:16:40] !log purged bpfcc-tools from kubernetes1013 [11:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:21] 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: +2 for esther-akinloose in Gerrit (mediawiki/extensions/VisualEditor) - https://phabricator.wikimedia.org/T305373 (10zeljkofilipin) >>! In T305373#7915339, @RLazarus wrote: > Just picking up SRE clinic duty for the week -- I'm so sorry this has been sittin... [11:22:08] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:22:58] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:23:49] (03PS12) 10Jaime Nuche: scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) [11:24:56] 10SRE, 10LDAP, 10User-jbond: Migrate web services using LDAP authentication towards the readonly LDAP replicas - https://phabricator.wikimedia.org/T227650 (10MoritzMuehlenhoff) I ran tcpdump for the LDAP ports on seaborgium and serpens for a little over an hour: On serpens: - alert1001 - alert2001 -... [11:25:16] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:25:47] (03CR) 10Jaime Nuche: [C: 04-1] scap: add new `scap` user to deployment hosts and scap targets [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [11:27:18] (03CR) 10Muehlenhoff: scap: add new `scap` user to deployment hosts and scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789146 (https://phabricator.wikimedia.org/T306991) (owner: 10Jaime Nuche) [11:28:12] (03PS3) 10Majavah: P:openstack::encapi: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) [11:30:04] PROBLEM - Disk space on ms-be1040 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1040&var-datasource=eqiad+prometheus/ops [11:32:40] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:55] (03PS4) 10Majavah: P:openstack::encapi: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) [11:37:16] (03CR) 10Jgiannelos: [C: 03+1] tegola: reduce number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/790405 (https://phabricator.wikimedia.org/T307757) (owner: 10Hnowlan) [11:37:23] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35181/console" [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [11:39:25] (03PS11) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) [11:40:14] (03PS12) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) [11:41:28] (03CR) 10Majavah: [V: 03+1] P:openstack::encapi: add keystone token verification (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [11:42:08] (03CR) 10JMeybohm: Add a cookbook for rolling reboot of k8s clusters (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [11:42:35] (03PS1) 10Jbond: P:etcd::tlsproxy: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/790656 (https://phabricator.wikimedia.org/T307383) [11:42:39] (03PS1) 10Jbond: P:etcd::tlsproxy: move to cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) [11:43:10] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/790658 [11:43:38] (03CR) 10jerkins-bot: [V: 04-1] P:etcd::tlsproxy: move to cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [11:43:41] (03CR) 10Elukey: [C: 03+2] kubernetes: allow deploy-ml-service users to check pods on ml-serve [puppet] - 10https://gerrit.wikimedia.org/r/790288 (owner: 10Elukey) [11:45:48] (03PS2) 10Jbond: P:etcd::tlsproxy: move to cfssl pki [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) [11:45:58] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1002 is CRITICAL: instance=10.64.48.64 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:46:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35183/console" [puppet] - 10https://gerrit.wikimedia.org/r/790657 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [11:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:48:29] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:52] (03CR) 10Jbond: [C: 03+2] P:etcd::tlsproxy: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/790656 (https://phabricator.wikimedia.org/T307383) (owner: 10Jbond) [11:49:58] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/790658 (owner: 10PipelineBot) [11:50:20] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [11:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:48] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ores2004.codfw.wmnet with OS buster [11:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:56] (03PS1) 10Klausman: hiera: Use celery v5 on ores2004 [puppet] - 10https://gerrit.wikimedia.org/r/790661 (https://phabricator.wikimedia.org/T303801) [11:52:05] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [11:52:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:17] (03CR) 10Jbond: [C: 03+2] rake_modules: rafactor git helper and add new_files [puppet] - 10https://gerrit.wikimedia.org/r/790294 (owner: 10Jbond) [11:52:39] (03PS14) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [11:52:45] (03PS2) 10Muehlenhoff: Enable ganeti3 component in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/790643 (https://phabricator.wikimedia.org/T307997) [11:53:54] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [11:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:13] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/790658 (owner: 10PipelineBot) [11:55:10] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [11:55:18] (03PS3) 10JMeybohm: Add kubernetes admin credentials to cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/789808 [11:55:20] (03PS1) 10JMeybohm: Align cumin aliases for wikikube clusters [puppet] - 10https://gerrit.wikimedia.org/r/790662 (https://phabricator.wikimedia.org/T260661) [11:55:37] (03PS8) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [11:55:39] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [11:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:56:23] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [11:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:04] (03CR) 10Klausman: [C: 03+2] hiera: Use celery v5 on ores2004 [puppet] - 10https://gerrit.wikimedia.org/r/790661 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [11:57:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/790643 (https://phabricator.wikimedia.org/T307997) (owner: 10Muehlenhoff) [12:09:17] (03PS3) 10Muehlenhoff: Enable ganeti3 component in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/790643 (https://phabricator.wikimedia.org/T307997) [12:10:43] (03PS1) 10Jbond: apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790664 [12:10:56] RECOVERY - SSH on wtp1048.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:11:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/790643 (https://phabricator.wikimedia.org/T307997) (owner: 10Muehlenhoff) [12:11:26] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790664 (owner: 10Jbond) [12:15:34] (03CR) 10Muehlenhoff: [C: 03+2] Enable ganeti3 component in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/790643 (https://phabricator.wikimedia.org/T307997) (owner: 10Muehlenhoff) [12:16:48] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2004.codfw.wmnet with reason: host reimage [12:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:38] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) [12:19:40] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2004.codfw.wmnet with reason: host reimage [12:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:00] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:22:27] (03CR) 10MMandere: [C: 03+1] mtail::cache_haproxy: Add HAProxy SLI counters [puppet] - 10https://gerrit.wikimedia.org/r/790298 (https://phabricator.wikimedia.org/T307898) (owner: 10Vgutierrez) [12:26:13] !log jelto@cumin1001 conftool action : set/pooled=yes; selector: name=mw2412.codfw.wmnet [12:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:45] (03PS1) 10Slyngshede: Replace crontab with systemd timer for WikiTech dumps. [puppet] - 10https://gerrit.wikimedia.org/r/790670 (https://phabricator.wikimedia.org/T273673) [12:36:14] (03PS1) 10Filippo Giunchedi: prometheus: remove http availability pages, moved to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/790671 (https://phabricator.wikimedia.org/T305847) [12:36:16] (03PS1) 10Vgutierrez: Add HAProxy SLO dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/790672 (https://phabricator.wikimedia.org/T307898) [12:37:38] jouncebot: next [12:37:38] In 0 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220510T1300) [12:38:01] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host graphite2003.codfw.wmnet [12:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:05] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35184/console" [puppet] - 10https://gerrit.wikimedia.org/r/790670 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:43:46] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite2003.codfw.wmnet [12:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:33] (03CR) 10Slyngshede: Replace crontab with systemd timer for WikiTech dumps. [puppet] - 10https://gerrit.wikimedia.org/r/790670 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:47:43] (03PS3) 10Awight: Enable CodeMirror colorblind-friendly palette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787690 (https://phabricator.wikimedia.org/T306867) [12:48:03] (03CR) 10JMeybohm: [C: 04-1] Add datahub-gms to the service catalog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780651 (https://phabricator.wikimedia.org/T305358) (owner: 10Btullis) [12:48:09] (03CR) 10Awight: Enable CodeMirror colorblind-friendly palette (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787690 (https://phabricator.wikimedia.org/T306867) (owner: 10Awight) [12:49:29] Lucas_WMDE: I was expecting to self-deploy the config patches. And thank you for the helpful review! [12:49:35] ok! [12:52:14] (03CR) 10Jbond: [C: 03+2] P:wmcs::nfs::maintain_dbusers: replace query_nodes with puppetdb_query [puppet] - 10https://gerrit.wikimedia.org/r/787485 (owner: 10Jbond) [12:52:22] (03PS1) 10Ayounsi: Update submodule and requirements for 3.1.11 [software/netbox-deploy] (3-1) - 10https://gerrit.wikimedia.org/r/790675 (https://phabricator.wikimedia.org/T296452) [12:53:22] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores2004.codfw.wmnet with OS buster [12:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:54] 10SRE, 10Infrastructure-Foundations: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:56:55] (03CR) 10David Caro: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/790648 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [12:57:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-deploy] (3-1) - 10https://gerrit.wikimedia.org/r/790675 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [12:59:38] I will be a little late for the deploy window. Can self-deploy. [13:00:04] RoanKattouw, Lucas_WMDE, and Urbanecm: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220510T1300). [13:00:04] WMDE-Fisch, tgr, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:17] awight: go ahead [13:00:18] * urbanecm waves [13:01:43] (03PS1) 10Klausman: hiera: Use celery v5 on ores2005 [puppet] - 10https://gerrit.wikimedia.org/r/790678 (https://phabricator.wikimedia.org/T303801) [13:02:29] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787690 (https://phabricator.wikimedia.org/T306867) (owner: 10Awight) [13:02:52] !log klausman@cumin2002 START - Cookbook sre.hosts.reimage for host ores2005.codfw.wmnet with OS buster [13:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:37] (03PS2) 10Klausman: hiera: Use celery v5 on ores2005 [puppet] - 10https://gerrit.wikimedia.org/r/790678 (https://phabricator.wikimedia.org/T303801) [13:04:15] \o/ [13:05:33] (03PS4) 10Awight: Enable CodeMirror colorblind-friendly palette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787690 (https://phabricator.wikimedia.org/T306867) [13:05:41] (03CR) 10Awight: [C: 03+2] "Deploying" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787690 (https://phabricator.wikimedia.org/T306867) (owner: 10Awight) [13:05:50] For a moment there I thought WMDE-Fisch was celebrating my trivial patch :D [13:06:02] Maybe I was ;-) [13:06:09] I can choose to believe that [13:06:27] (03Merged) 10jenkins-bot: Enable CodeMirror colorblind-friendly palette [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787690 (https://phabricator.wikimedia.org/T306867) (owner: 10Awight) [13:08:10] (03CR) 10Jbond: [C: 03+1] "LGTM thanks 😊" [puppet] - 10https://gerrit.wikimedia.org/r/790633 (owner: 10Muehlenhoff) [13:09:44] (03CR) 10ArielGlenn: "Note that someone from the WMCS side of things ought to review this, it's out of scope for regular dumps. Not sure what state the wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/790670 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:10:58] (03PS1) 10Isabelle Hurbain-Palatin: Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) [13:11:31] (03CR) 10jerkins-bot: [V: 04-1] Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin) [13:11:58] (03CR) 10Muehlenhoff: "I can't meaningfully comment on the Ruby code itself, but I'm wondering if we shouldn't simply apply the check to all files by default (an" [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [13:12:39] (03PS2) 10Awight: Enable new template dialog sidebar everywhere except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787699 (https://phabricator.wikimedia.org/T296759) [13:12:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:12:48] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787699 (https://phabricator.wikimedia.org/T296759) (owner: 10Awight) [13:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:55] !log awight@deploy1002 Synchronized wmf-config: Config: [[gerrit:787690|Enable CodeMirror colorblind-friendly palette (T306867)]] (duration: 00m 51s) [13:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:00] T306867: Deploy colorblind friendly color scheme for CodeMirror to all wikis using CodeMirror - https://phabricator.wikimedia.org/T306867 [13:13:10] (03PS2) 10Isabelle Hurbain-Palatin: Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) [13:13:36] (03Merged) 10jenkins-bot: Enable new template dialog sidebar everywhere except enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/787699 (https://phabricator.wikimedia.org/T296759) (owner: 10Awight) [13:13:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:13:43] (03PS1) 10Ayounsi: Netbox: update config file for 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/790681 (https://phabricator.wikimedia.org/T296452) [13:13:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:51] (03CR) 10jerkins-bot: [V: 04-1] Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin) [13:14:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:14:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:49] (03PS3) 10Isabelle Hurbain-Palatin: Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) [13:15:35] (03CR) 10jerkins-bot: [V: 04-1] Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin) [13:18:03] !log awight@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:787699|Enable new template dialog sidebar everywhere except enwiki (T296759)]] (duration: 00m 49s) [13:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:08] T296759: Deploy VE template dialog improvements to more wikis - https://phabricator.wikimedia.org/T296759 [13:18:26] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/790681 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:19:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:00] (03PS2) 10Ayounsi: Netbox: update config file for 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/790681 (https://phabricator.wikimedia.org/T296452) [13:20:07] (03PS3) 10Awight: Enable versioned maps everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788347 (https://phabricator.wikimedia.org/T300712) [13:20:24] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788347 (https://phabricator.wikimedia.org/T300712) (owner: 10Awight) [13:20:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:20:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:11] (03Merged) 10jenkins-bot: Enable versioned maps everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788347 (https://phabricator.wikimedia.org/T300712) (owner: 10Awight) [13:21:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:56] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Update submodule and requirements for 3.1.11 [software/netbox-deploy] (3-1) - 10https://gerrit.wikimedia.org/r/790675 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:22:35] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) Well I'm not getting anywhere very fast with this. I now understand from @akosiaris... [13:24:59] (03CR) 10Muehlenhoff: [C: 03+2] Add a note on the home for systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/790633 (owner: 10Muehlenhoff) [13:25:59] (03CR) 10Vgutierrez: [C: 03+1] P:cache::varnish::frontend: mask the varnishncsa service [puppet] - 10https://gerrit.wikimedia.org/r/789262 (owner: 10Ssingh) [13:26:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:06] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Upgrade Netbox-dev2002 to 3.1 [13:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:27:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:58] (03CR) 10Muehlenhoff: apereo_cas: convert module to use SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790664 (owner: 10Jbond) [13:28:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:28:38] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Upgrade Netbox-dev2002 to 3.1 (duration: 01m 32s) [13:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:01] (03CR) 10Klausman: [C: 03+2] hiera: Use celery v5 on ores2005 [puppet] - 10https://gerrit.wikimedia.org/r/790678 (https://phabricator.wikimedia.org/T303801) (owner: 10Klausman) [13:29:29] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I identified the node process that was running eventgate-analytics-external, then ra... [13:29:41] (03PS15) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [13:30:18] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ores2005.codfw.wmnet with reason: host reimage [13:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:18] (03CR) 10Stang: Remove upload rights on wikis where local uploads are disabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) (owner: 10Stang) [13:31:22] !log awight@deploy1002 Synchronized wmf-config: Config: [[gerrit:788347|Enable versioned maps everywhere (T300712)]] (duration: 00m 50s) [13:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:26] T300712: Deploy versioned maps support to production - https://phabricator.wikimedia.org/T300712 [13:32:39] (03PS16) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [13:33:33] tgr: Would you like to self-deploy, or can I help? [13:33:51] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ores2005.codfw.wmnet with reason: host reimage [13:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:47] Lucas_WMDE: Same question--can I deploy your patch, or would you prefer to do it yourself? [13:35:54] I can deploy it [13:35:57] or you can, if you want [13:36:03] it’s a production no-op anyways ^^ [13:36:22] :-) Sure, it's the least I can do [13:36:31] ok, thanks! [13:36:38] (03PS1) 10Ayounsi: Rename invalidate -> clearcache function [software/netbox-deploy] (3-1) - 10https://gerrit.wikimedia.org/r/790685 (https://phabricator.wikimedia.org/T296452) [13:36:42] (03PS2) 10Awight: Configure wgLexemeLexicalCategoryItemIds on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790398 (https://phabricator.wikimedia.org/T298150) (owner: 10Lucas Werkmeister (WMDE)) [13:36:49] (03CR) 10Awight: [C: 03+2] "Deploying." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790398 (https://phabricator.wikimedia.org/T298150) (owner: 10Lucas Werkmeister (WMDE)) [13:37:38] (03Merged) 10jenkins-bot: Configure wgLexemeLexicalCategoryItemIds on Beta Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790398 (https://phabricator.wikimedia.org/T298150) (owner: 10Lucas Werkmeister (WMDE)) [13:38:06] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Rename invalidate -> clearcache function [software/netbox-deploy] (3-1) - 10https://gerrit.wikimedia.org/r/790685 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:38:36] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Upgrade Netbox-dev2002 to 3.1 [13:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:47] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Upgrade Netbox-dev2002 to 3.1 (duration: 02m 11s) [13:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:22] (03PS4) 10Isabelle Hurbain-Palatin: Improve performance of Tegola tile pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) [13:41:29] Lucas_WMDE: Should be live now (beta) [13:41:41] cool, thanks! [13:41:45] tgr: Feel free to ping me or deploy yourself. [13:42:02] awight: indeed, it’s working :) thank you! [13:42:09] /o\ [13:42:15] I mean \o/ [13:42:36] thx [13:42:42] (03CR) 10Isabelle Hurbain-Palatin: Improve performance of Tegola tile pregeneration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790679 (https://phabricator.wikimedia.org/T307182) (owner: 10Isabelle Hurbain-Palatin) [13:43:17] tgr: Shall I deploy? [13:43:23] (03PS2) 10Lucas Werkmeister (WMDE): Configure wgLexemeLexicalCategoryItemIds on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790399 (https://phabricator.wikimedia.org/T307441) [13:43:34] ah nvm, I see you've logged in :-) [13:43:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:53] yeah, thanks, just started [13:44:32] (03PS1) 10Ayounsi: Remove clearcache [software/netbox-deploy] (3-1) - 10https://gerrit.wikimedia.org/r/790687 (https://phabricator.wikimedia.org/T296452) [13:44:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:44:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: set an X-Requestctl header for matching rules [software/conftool] - 10https://gerrit.wikimedia.org/r/787437 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [13:45:27] (03CR) 10Gergő Tisza: [C: 03+2] GrowthExperiments: Start mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T307844) (owner: 10Kosta Harlan) [13:45:30] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Remove clearcache [software/netbox-deploy] (3-1) - 10https://gerrit.wikimedia.org/r/790687 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:45:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:00] (03PS7) 10Gergő Tisza: GrowthExperiments: Start mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T307844) (owner: 10Kosta Harlan) [13:46:04] (03CR) 10Gergő Tisza: GrowthExperiments: Start mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T307844) (owner: 10Kosta Harlan) [13:46:41] (03CR) 10Hnowlan: [C: 03+2] tegola: reduce number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/790405 (https://phabricator.wikimedia.org/T307757) (owner: 10Hnowlan) [13:46:51] (03Merged) 10jenkins-bot: GrowthExperiments: Start mailing list campaign on eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/775951 (https://phabricator.wikimedia.org/T307844) (owner: 10Kosta Harlan) [13:47:12] (03Merged) 10jenkins-bot: requestctl: set an X-Requestctl header for matching rules [software/conftool] - 10https://gerrit.wikimedia.org/r/787437 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [13:47:31] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Upgrade Netbox-dev2002 to 3.1 [13:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:53] (03PS1) 10Ssingh: test_dns: add a test to display resolver information (such as the NSID) [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/790688 [13:48:10] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:cache::varnish::frontend: mask the varnishncsa service [puppet] - 10https://gerrit.wikimedia.org/r/789262 (owner: 10Ssingh) [13:48:48] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Upgrade Netbox-dev2002 to 3.1 (duration: 01m 17s) [13:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:02] !log tgr@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:775951|GrowthExperiments: Start mailing list campaign on eswiki (T307844)]] (duration: 00m 51s) [13:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:06] T307844: Turn on GrowthExperiments welcome email campaign - https://phabricator.wikimedia.org/T307844 [13:50:28] (03CR) 10Ssingh: [C: 03+2] test_dns: add a test to display resolver information (such as the NSID) [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/790688 (owner: 10Ssingh) [13:50:35] !log EU mid-day deploys done [13:50:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:50:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:34] (03Merged) 10jenkins-bot: tegola: reduce number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/790405 (https://phabricator.wikimedia.org/T307757) (owner: 10Hnowlan) [13:51:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:51:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:51:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:50] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10BTullis) I've done more analysis of packet captures from eventgate-analytics-external and I s... [13:59:40] (03PS1) 10Ssingh: test_dns: display the NSID before running the other tests [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/790690 [13:59:47] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:00:34] (03CR) 10Ssingh: [C: 03+2] test_dns: display the NSID before running the other tests [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/790690 (owner: 10Ssingh) [14:00:41] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: Allow detecting matching rules that are disabled [software/conftool] - 10https://gerrit.wikimedia.org/r/787438 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [14:02:30] (03Merged) 10jenkins-bot: requestctl: Allow detecting matching rules that are disabled [software/conftool] - 10https://gerrit.wikimedia.org/r/787438 (https://phabricator.wikimedia.org/T305582) (owner: 10Giuseppe Lavagetto) [14:03:19] (03CR) 10Giuseppe Lavagetto: [C: 03+2] reqestctl: add unit tests for grammar parsing [software/conftool] - 10https://gerrit.wikimedia.org/r/789153 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto) [14:04:01] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ores2005.codfw.wmnet with OS buster [14:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:08] (03Merged) 10jenkins-bot: reqestctl: add unit tests for grammar parsing [software/conftool] - 10https://gerrit.wikimedia.org/r/789153 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto) [14:05:24] (03CR) 10Giuseppe Lavagetto: requestctl: add AND NOT and OR NOT to the parsing grammar (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto) [14:07:05] PROBLEM - Check systemd state on ms-be1064 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:13] (03PS3) 10Ayounsi: Netbox: update config file for 3.1 [puppet] - 10https://gerrit.wikimedia.org/r/790681 (https://phabricator.wikimedia.org/T296452) [14:09:45] !log klausman@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [14:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:50] !log klausman@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 05s) [14:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:54] !log klausman@deploy1002 Started deploy [ores/deploy@98a1b2e]: (no justification provided) [14:09:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:01] !log klausman@deploy1002 Finished deploy [ores/deploy@98a1b2e]: (no justification provided) (duration: 00m 07s) [14:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:06] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [14:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:18] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [14:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:19] (03PS1) 10Ayounsi: Update submodule and requirements for 3.2.2 [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/790693 (https://phabricator.wikimedia.org/T296452) [14:13:45] (03CR) 10JMeybohm: [C: 04-1] New service: image-suggestion (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [14:18:22] (03PS1) 10MVernon: swift: drain ms-be1059, skip cluster-OK checks [puppet] - 10https://gerrit.wikimedia.org/r/790694 (https://phabricator.wikimedia.org/T307667) [14:18:30] (03PS17) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [14:19:24] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/790693 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [14:20:40] (03PS18) 10Stang: Remove upload rights on wikis where local uploads are disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790403 (https://phabricator.wikimedia.org/T143789) [14:20:42] (03CR) 10MVernon: "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/790694 (https://phabricator.wikimedia.org/T307667) (owner: 10MVernon) [14:27:48] (03PS3) 10Zabe: swift: remove absented stats_account cron [puppet] - 10https://gerrit.wikimedia.org/r/778486 (https://phabricator.wikimedia.org/T273673) [14:31:36] (03PS1) 10Jelto: gitlab: allow multiple passive hosts [puppet] - 10https://gerrit.wikimedia.org/r/790699 (https://phabricator.wikimedia.org/T307142) [14:32:44] (03PS17) 10Brennen Bearnes: gitlab runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [14:34:39] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35185/console" [puppet] - 10https://gerrit.wikimedia.org/r/790699 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [14:35:28] (03CR) 10Brennen Bearnes: gitlab runner: restrict docker images and services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [14:35:46] 10SRE, 10DC-Ops, 10Infrastructure-Foundations: private repo deployment - perccli implementation - https://phabricator.wikimedia.org/T308027 (10RobH) p:05Triage→03High [14:35:49] (03CR) 10Brennen Bearnes: gitlab runner: restrict docker images and services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [14:35:56] 10SRE, 10DC-Ops, 10Infrastructure-Foundations: private repo deployment - perccli implementation - https://phabricator.wikimedia.org/T308027 (10RobH) [14:36:00] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10RobH) [14:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:38:04] (03PS9) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [14:38:21] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:38:43] (03CR) 10Jelto: [V: 03+1] "Can you take a look for a review? The diff is quite big because I had to suffix all of the rsync resource to make them unique. The diff al" [puppet] - 10https://gerrit.wikimedia.org/r/790699 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [14:44:33] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:49:01] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10akosiaris) >>! In T306181#7917390, @BTullis wrote: > > We have proposed creating a new bucke... [14:50:32] (03PS1) 10Thiemo Kreuz (WMDE): Fix incomplete FlaggedRevs::binaryFlagging() implementation [extensions/FlaggedRevs] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/790436 (https://phabricator.wikimedia.org/T307972) [14:51:01] (03PS1) 10Thiemo Kreuz (WMDE): Fix incomplete FlaggedRevs::binaryFlagging() implementation [extensions/FlaggedRevs] (wmf/1.39.0-wmf.11) - 10https://gerrit.wikimedia.org/r/790437 (https://phabricator.wikimedia.org/T307972) [14:51:45] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:54:29] (03PS1) 10Jbond: netbox: CableStatusChoices renamed to LinkStatusChoices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 [14:54:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] service::docker: allow use of 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/789846 (owner: 10Giuseppe Lavagetto) [14:54:50] 10SRE, 10Data-Engineering, 10Data-Engineering-Kanban, 10Traffic, 10Patch-For-Review: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10Ottomata) > all that is needed is to deploy a version of eventgate On it. I had issues with... [14:54:58] (03PS1) 10Stang: wikidata: Remove 'mainpage' from wgForceUIMsgAsContentMsg [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790706 (https://phabricator.wikimedia.org/T184386) [14:56:06] (03CR) 10Stang: [C: 04-1] "wait" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790706 (https://phabricator.wikimedia.org/T184386) (owner: 10Stang) [14:58:25] (03CR) 10David Caro: rake_modules: add check for spdk licence header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [15:00:02] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=codfw [15:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:33] (03CR) 10Muehlenhoff: OpenLDAP, move restart cronjob to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790614 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [15:00:49] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:01:16] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host chartmuseum2001.codfw.wmnet [15:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:45] RECOVERY - Check systemd state on ms-be1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:03:15] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum2001.codfw.wmnet [15:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:39] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [15:04:04] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=codfw [15:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:56] (03PS2) 10Cathal Mooney: List IPs reserved for top-of-rack in cloud ceph hosts profile [puppet] - 10https://gerrit.wikimedia.org/r/790648 (https://phabricator.wikimedia.org/T304989) [15:05:35] !log jayme@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=helm-charts,name=eqiad [15:05:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:31] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-single for host chartmuseum1001.eqiad.wmnet [15:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:05] (03PS2) 10Jbond: netbox: Add fixes for netbox 3.1 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/790705 [15:08:15] (03CR) 10Muehlenhoff: "Similar comment as for https://gerrit.wikimedia.org/r/c/operations/puppet/+/786310, maybe we should rather add the tax universally to all " [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [15:08:32] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host chartmuseum1001.eqiad.wmnet [15:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:06] !log jayme@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=helm-charts,name=eqiad [15:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:49] (03PS1) 10Zabe: filtered_tables: remove flaggedpage_config.fpc_select [puppet] - 10https://gerrit.wikimedia.org/r/790708 (https://phabricator.wikimedia.org/T262978) [15:12:05] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I can confirm this. https://codesearch.wmcloud.org/search/?q=FlaggedRevsTags%5Cb. Also see I07cbcf2." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/790707 (https://phabricator.wikimedia.org/T277883) (owner: 10Awight) [15:12:19] (03PS2) 10Zabe: filtered_tables: remove flaggedpage_config.fpc_select [puppet] - 10https://gerrit.wikimedia.org/r/790708 (https://phabricator.wikimedia.org/T262978) [15:12:37] (03PS5) 10Btullis: Use both dbproxy101[89] servers for both wikireplica services [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) [15:12:55] (03CR) 10Btullis: Use both dbproxy101[89] servers for both wikireplica services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) (owner: 10Btullis) [15:23:03] (03PS6) 10Btullis: Use both dbproxy101[89] servers for both wikireplica services [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) [15:24:11] (03PS7) 10Btullis: Use both dbproxy101[89] servers for both wikireplica services [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) [15:26:50] (03PS1) 10Ottomata: eventgate-* - bump image to get 10s latency bucket metric [deployment-charts] - 10https://gerrit.wikimedia.org/r/790709 [15:27:36] (03PS1) 10Majavah: P:toolforge: remove linux kernel pinnings [puppet] - 10https://gerrit.wikimedia.org/r/790710 (https://phabricator.wikimedia.org/T290494) [15:29:14] (03PS2) 10Majavah: P:toolforge: remove linux kernel pinnings [puppet] - 10https://gerrit.wikimedia.org/r/790710 (https://phabricator.wikimedia.org/T290494) [15:30:02] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-* - bump image to get 10s latency bucket metric [deployment-charts] - 10https://gerrit.wikimedia.org/r/790709 (owner: 10Ottomata) [15:30:46] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [15:30:48] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [15:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:31:05] (03PS15) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [15:31:13] (03PS10) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [15:32:34] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [15:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:10] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [15:33:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:52] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [15:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:17] !log rolling deploy/restart of all eventgate services to get 10s latency bucket metric - T306181 [15:35:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:21] (03CR) 10Andrew Bogott: [C: 03+1] "I'm sure there were reasons for splitting this out originally but it will make our lives easier to pair them so let's try it and see if it" [puppet] - 10https://gerrit.wikimedia.org/r/779915 (https://phabricator.wikimedia.org/T298940) (owner: 10Btullis) [15:35:22] T306181: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 [15:35:36] !log jayme@cumin1001 START - Cookbook sre.hosts.reboot-cluster [15:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:57] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [15:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:56] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [15:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:20] (03PS1) 10Giuseppe Lavagetto: requestctl: add "find" command [software/conftool] - 10https://gerrit.wikimedia.org/r/790712 (https://phabricator.wikimedia.org/T305638) [15:39:35] (03CR) 10David Caro: [C: 03+1] P:openstack::encapi: add keystone token verification [puppet] - 10https://gerrit.wikimedia.org/r/785134 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [15:39:53] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [15:39:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:03] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [15:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:31] (03CR) 10jerkins-bot: [V: 04-1] requestctl: add "find" command [software/conftool] - 10https://gerrit.wikimedia.org/r/790712 (https://phabricator.wikimedia.org/T305638) (owner: 10Giuseppe Lavagetto) [15:41:37] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [15:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:57] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [15:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:08] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10thcipriani) >>! In T303857#7916168, @Joe wrote: > Hey @thcipriani t... [15:43:55] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [15:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:59] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [15:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:40] (NodeTextfileStale) resolved: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:45:41] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) @Cmjohnson @Jclark-ctr I upload the junos-srxsme-20.1R1.11.tgz to apt.wikimedia.org under /srv/junos . if you have time this week can you please copy that i... [15:46:54] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [15:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:51] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [15:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:02] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [15:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:22] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [15:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:41] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [15:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:28] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [15:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:51] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:53:24] (03PS5) 10Hnowlan: New service: image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) [15:53:42] (03CR) 10Hnowlan: New service: image-suggestion (037 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [15:53:52] (03PS11) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [15:53:55] (03PS1) 10Jbond: apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) [15:54:24] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [15:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:33] (03CR) 10Jbond: "I have re-done this commit as https://gerrit.wikimedia.org/r/c/operations/puppet/+/790716 so that it is in the dependency chain with the r" [puppet] - 10https://gerrit.wikimedia.org/r/790664 (owner: 10Jbond) [15:54:38] (03Abandoned) 10Jbond: apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790664 (owner: 10Jbond) [15:55:00] (03CR) 10Jbond: rake_modules: add check for spdk licence header (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [15:55:06] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [15:55:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:17] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [15:55:23] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [15:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:53] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [15:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:02] (03CR) 10jerkins-bot: [V: 04-1] New service: image-suggestion [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [15:56:32] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [15:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:24] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [15:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:14] 10SRE, 10SRE-Access-Requests, 10Infrastructure-Foundations, 10serviceops, 10Release-Engineering-Team (Radar): Need a service account on deploy servers for automated train pre-sync operations - https://phabricator.wikimedia.org/T303857 (10jcrespo) If it helps- we have daily /srv backups of the deployment... [16:00:05] jbond and rzl: My dear minions, it's time we take the moon! Just kidding. Time for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220510T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:30] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) @Cmjohnson @Jclark-ctr the right image is junos-srxentedge-x86-64-20.4R3-S1.3.tgz and not junos-srxsme-20.1R1.11.tgz since codfw is using junos-srxentedge... [16:00:56] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [16:00:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:39] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [16:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:11] PROBLEM - SSH on furud.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:07:31] (03PS16) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [16:08:32] (03CR) 10jerkins-bot: [V: 04-1] rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [16:09:16] (03CR) 10Jbond: rake_modules: add check for spdk licence header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [16:10:22] (03CR) 10David Caro: rake_modules: add check for spdk licence header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [16:11:14] https://logstash.wikimedia.org/ only shows me an empty dashboard instead of the usual home page, is that an intentional change? [16:11:25] (03PS12) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [16:11:30] (apparently there’s a default filter for host:ores2003, last 15 minutes, which yields no results…) [16:11:46] (03PS2) 10Jbond: apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) [16:11:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/790694 (https://phabricator.wikimedia.org/T307667) (owner: 10MVernon) [16:12:04] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: remove absented stats_account cron [puppet] - 10https://gerrit.wikimedia.org/r/778486 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:12:21] (03CR) 10jerkins-bot: [V: 04-1] rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [16:12:48] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [16:12:58] (03CR) 10David Caro: [C: 03+1] rake_modules: add check for spdk licence header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [16:13:42] 10SRE-swift-storage, 10Patch-For-Review: Move swift crons to systemd timers - https://phabricator.wikimedia.org/T288806 (10Zabe) [16:13:54] Lucas_WMDE: someone must have edited accidentally the home dashboard [16:14:07] looks like it [16:14:09] Does anyone know if old versions are stored? [16:14:11] now the ores2003 filter is gone [16:14:14] I just edited it to remove the filter [16:14:18] I hope it wasn’t me >.< [16:14:20] ok [16:16:41] did the "home" dashboard normally have some filters on it or has it historically just been a raw feed? [16:16:49] "The dashboards and visualizations are saved inside the .kibana index on your Elasticsearch cluster. If you had backup done to it, you can recover them from there, but otherwise there is no way" [16:17:07] bd808: it was an actually nice overview of the main dashboards [16:17:15] yeah, it was quite useful [16:17:20] with links grouped by function [16:17:25] :(( [16:17:40] some backend dashboards, some frontend dashboards, some Wikibase ones [16:19:03] was briefly excited by this heading, but then the content didn't help -- https://wikitech.wikimedia.org/wiki/OpenSearch_Dashboards#Homepage [16:19:51] (03CR) 10Ayounsi: [V: 03+2 C: 03+2] Update submodule and requirements for 3.2.2 [software/netbox-deploy] (3-2-2) - 10https://gerrit.wikimedia.org/r/790693 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [16:21:26] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Upgrade Netbox-dev2002 to 3.2 [16:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:32] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Upgrade Netbox-dev2002 to 3.2 (duration: 02m 06s) [16:23:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:20] (03CR) 10Jelto: "Keep in mind that a config-template is used when the Runner is registered. So the change should take effect only after de-registering and " [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [16:25:29] jynus, Lucas_WMDE: I think I found the visualization that was the dashboard links... how does https://logstash.wikimedia.org/ look now? [16:26:02] !log ayounsi@deploy1002 Started deploy [netbox/deploy@7bbf659]: Upgrade Netbox-dev2002 to 3.2 [16:26:05] ah- you found the top template, right? [16:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:41] yeah, the panels get saved separately. [16:27:11] so that is probably the most important part [16:27:25] (03CR) 10RLazarus: [V: 03+1] deployment_server: Add keyholder identity for scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790455 (https://phabricator.wikimedia.org/T307351) (owner: 10RLazarus) [16:27:27] but I think there was some other dedicated ones below that [16:27:47] !log ayounsi@deploy1002 Finished deploy [netbox/deploy@7bbf659]: Upgrade Netbox-dev2002 to 3.2 (duration: 01m 45s) [16:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:27] bd808: that looks much closer to what I remembered, thank you <3 [16:28:42] the part below might have been different but I don’t care as much about that [16:28:47] as long as the dashboard links are there [16:29:07] I wonder if we have any recent screenshots of the home page [16:29:13] in some training material or whatever [16:31:50] looking at a slide deck Krinkle made a couple of years ago and the "- Home" there looks like just the markdown links panel filling the whole screen. [16:33:30] I think I am going to create a ticket- we should have backups of this and grafana dashboards (shouldn't take much space for both) [16:34:00] Imagine instead of 1 dashboard we lose all- too many person hours lost [16:35:40] 👍 [16:37:05] I took the log viewer off of the "- Home" default dashboard and made the markdown panel bigger. I'll stop piddling now. [16:37:41] T237224 [16:37:41] T237224: Backup kibana indices - https://phabricator.wikimedia.org/T237224 [16:38:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10nskaggs) I was wondering why some of them were spread the way they were (aka outside WMCS dedicated racks), but I see @cmooney updated the rackin... [16:39:38] the other issue is - whoever did it- not an issue if it was a mistake, but we should make it aware to prevent from happening again by mistake [16:42:56] (03PS1) 10David Caro: rabbitmq: add missing internal port [puppet] - 10https://gerrit.wikimedia.org/r/790725 [16:43:39] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:43:50] (03CR) 10Andrew Bogott: [C: 03+1] "I do not know how we were getting along without this -- the docs certainly imply that it's needed for clustering." [puppet] - 10https://gerrit.wikimedia.org/r/790725 (owner: 10David Caro) [16:44:00] agreed. no shaming obviously, but help not making the same mistake would be good [16:44:11] (03CR) 10David Caro: P:openstack::rabbitmq: cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787003 (owner: 10Majavah) [16:44:28] as my guess someone could be thinking that "saving the home page" would be their personal home :-) [16:44:32] (03CR) 10David Caro: P:openstack::rabbitmq: cleanup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/787003 (owner: 10Majavah) [16:44:33] * bd808 tries to fail in a brand new way each and every day [16:44:44] (like phabricator works) [16:45:24] (03CR) 10David Caro: [C: 03+2] rabbitmq: add missing internal port [puppet] - 10https://gerrit.wikimedia.org/r/790725 (owner: 10David Caro) [16:46:55] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:55:23] 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: +2 for esther-akinloose in Gerrit (mediawiki/extensions/VisualEditor) - https://phabricator.wikimedia.org/T305373 (10RLazarus) > I think she already has a wikitech account: https://wikitech.wikimedia.org/wiki/User:Esther_Akinloose Oh, yep! It just didn't... [16:59:17] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Upgrade pfw to Junos 20+ - https://phabricator.wikimedia.org/T295691 (10Papaul) @Jgreen hello. I am planning on doing this on the 16th at 10:00am CT . let me know it time works for you. Thanks [17:01:55] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [17:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:46] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [17:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:53] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [17:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:02] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [17:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:05:07] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [17:05:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:00] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [17:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:05] (03CR) 10DannyS712: [C: 03+1] filtered_tables: remove flaggedpage_config.fpc_select [puppet] - 10https://gerrit.wikimedia.org/r/790708 (https://phabricator.wikimedia.org/T262978) (owner: 10Zabe) [17:08:02] (03CR) 10Jgiannelos: [C: 03+2] proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/787506 (owner: 10Jgiannelos) [17:12:31] (03Merged) 10jenkins-bot: proton: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/787506 (owner: 10Jgiannelos) [17:13:58] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [17:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:49] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [17:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:05] (03CR) 10Brennen Bearnes: gitlab runner: restrict docker images and services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [17:17:25] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [17:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:06] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [17:19:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:48] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [17:19:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:54] (03PS18) 10Brennen Bearnes: gitlab runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) [17:20:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10cmooney) @nskaggs yes the ones that require the Public Vlan are probably actually placed not in those dedicated WMCS racks, to leave as much room... [17:22:10] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [17:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:21] (03PS1) 10Btullis: Double the number of replicas of eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/790727 (https://phabricator.wikimedia.org/T306181) [17:39:09] (03CR) 10Stang: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/776334 (https://phabricator.wikimedia.org/T305320) (owner: 10NguoiDungKhongDinhDanh) [17:47:02] !log people2002 - reboot incoming [17:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:47] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:04:03] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Patch-For-Review: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10Cmjohnson) Assigned a case number and submitted AHS log. Successfully Submitted Case Number: 5364283545 [18:04:40] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10Patch-For-Review: Power drain and restart of ms-be1059 - https://phabricator.wikimedia.org/T307667 (10MatthewVernon) Thanks for the update :) [18:08:10] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for RoccoMo - https://phabricator.wikimedia.org/T308053 (10Isaac) [18:08:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS buster [18:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster [18:08:48] (03PS3) 10BCornwall: Add Kwaku Addo Ofori to ops manager approval list [puppet] - 10https://gerrit.wikimedia.org/r/790393 [18:09:25] (03CR) 10Ottomata: [C: 03+1] Double the number of replicas of eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/790727 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [18:09:48] (03CR) 10BCornwall: [C: 03+2] Add Kwaku Addo Ofori to ops manager approval list [puppet] - 10https://gerrit.wikimedia.org/r/790393 (owner: 10BCornwall) [18:10:48] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for RoccoMo - https://phabricator.wikimedia.org/T308053 (10Isaac) Hey SRE/Analytics -- we have a new formal collaborator onboard: @RoccoMo. They need access to HDFS and the stat machines for a new research project. Don't hesitate to... [18:16:56] 10SRE, 10RESTBase-API, 10Traffic: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10BBlack) The limit you're hitting is an intentional one, from this block in our edge code: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/... [18:17:45] (03PS1) 10Majavah: kubeadm: add support rebooting the nodes when upgrading [puppet] - 10https://gerrit.wikimedia.org/r/790735 [18:22:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for RoccoMo - https://phabricator.wikimedia.org/T308053 (10RoccoMo) [18:25:18] (03CR) 10Stang: [C: 04-1] "Typically this file is updated via command (instead of manually). An example could be found at Iad6e5de551a1db4aede70a41e678b7c1ab44449c" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623094 (owner: 10Keenan Pepper) [18:25:55] (03CR) 10Stang: [C: 04-1] "Also, for such non-trivial modification, please create a task on Phabricator first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623094 (owner: 10Keenan Pepper) [18:28:00] 10SRE, 10RESTBase-API, 10Traffic: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Mitar) Hm, but documentation for REST API says I can use 200 requests per second? https://en.wikipedia.org/api/rest_v1/ > Limit your clients to no more than 200 requests/s to this... [18:28:57] 10SRE, 10RESTBase-API, 10Traffic: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Mitar) Sadly bulk downloads do not have HTML dumps, and Enterprise dumps do not offer them for template/module documentation (only articles, categories, and files). Also, there are... [18:30:35] (03CR) 10Jbond: rake_modules: add check for spdk licence header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [18:31:24] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for RoccoMo - https://phabricator.wikimedia.org/T308053 (10RLazarus) @RoccoMo Hi from the SRE team! Thanks for the request, we'll get you sorted out shortly. @Ottomata Can you approve for Analytics please? @KFrancis I understand t... [18:32:28] (03CR) 10Stang: [C: 04-1] Add cubic hectometre conversion [mediawiki-config] - 10https://gerrit.wikimedia.org/r/623094 (owner: 10Keenan Pepper) [18:34:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for RoccoMo - https://phabricator.wikimedia.org/T308053 (10Isaac) [18:36:46] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1142.eqiad.wmnet with OS buster [18:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q2:(Need By: TBD) rack/setup/install an-worker11[42-48].eqiad.wmnet - https://phabricator.wikimedia.org/T293922 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-worker1142.eqiad.wmnet with OS buster exec... [18:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:39:59] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:40:46] (03PS17) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [18:52:41] 10SRE, 10RESTBase-API, 10Traffic: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10BBlack) >>! In T307610#7918538, @Mitar wrote: > Hm, but documentation for REST API says I can use 200 requests per second? https://en.wikipedia.org/api/rest_v1/ > >> Limit your cli... [18:54:57] 10SRE, 10RESTBase-API, 10Traffic: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Mitar) Hm, it seems that [comments are out of sync](https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/templates/text-frontend.... [19:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:02:36] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for RoccoMo - https://phabricator.wikimedia.org/T308053 (10KFrancis) @RLazarus I am confirming we have a signed NDA for Mo. Thanks! [19:05:43] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:08:19] (03PS11) 10Bking: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) [19:09:22] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [19:15:21] (03CR) 10Krinkle: [C: 03+1] Disable LocalisationUpdate, part II [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677326 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [19:15:28] (03CR) 10Krinkle: [C: 03+1] Disable LocalisationUpdate, part III [mediawiki-config] - 10https://gerrit.wikimedia.org/r/677327 (https://phabricator.wikimedia.org/T158360) (owner: 10Jforrester) [19:16:33] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [19:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:04] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Dzahn) [19:18:10] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10Dzahn) [19:18:49] (03PS13) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [19:18:51] (03PS3) 10Jbond: apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) [19:19:48] (03CR) 10jerkins-bot: [V: 04-1] rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [19:20:13] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [19:20:16] (03PS1) 10Dzahn: gitlab: license module files with SPDX-License-Identifier: Apache-2.0 [puppet] - 10https://gerrit.wikimedia.org/r/790743 (https://phabricator.wikimedia.org/T308013) [19:20:31] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:07] (03CR) 10jerkins-bot: [V: 04-1] gitlab: license module files with SPDX-License-Identifier: Apache-2.0 [puppet] - 10https://gerrit.wikimedia.org/r/790743 (https://phabricator.wikimedia.org/T308013) (owner: 10Dzahn) [19:21:32] (03PS4) 10Jbond: apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) [19:21:44] (03PS12) 10Bking: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) [19:22:13] (03CR) 10Dzahn: apereo_cas: convert module to use SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790664 (owner: 10Jbond) [19:22:39] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [19:23:01] (03CR) 10Dzahn: apereo_cas: convert module to use SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790664 (owner: 10Jbond) [19:23:11] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [19:24:08] (03CR) 10Dzahn: "jerkins does not like us adding license comments on the first line. " The shebang must be on the first line. Delete blanks and move commen" [puppet] - 10https://gerrit.wikimedia.org/r/790664 (owner: 10Jbond) [19:24:39] (03PS1) 10Cathal Mooney: Add 'includes' in private address reverse zones for new subnets [dns] - 10https://gerrit.wikimedia.org/r/790744 (https://phabricator.wikimedia.org/T304989) [19:24:48] (03PS13) 10Bking: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) [19:25:19] (03PS2) 10Cathal Mooney: Add 'includes' in private address reverse zones for new subnets [dns] - 10https://gerrit.wikimedia.org/r/790744 (https://phabricator.wikimedia.org/T304989) [19:26:07] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [19:26:56] (03CR) 10Jbond: apereo_cas: convert module to use SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790664 (owner: 10Jbond) [19:27:03] mutante: fyi ^^ [19:27:59] (03PS14) 10Bking: elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) [19:29:07] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: Java version is a fact, does not need to be a param [puppet] - 10https://gerrit.wikimedia.org/r/789644 (https://phabricator.wikimedia.org/T289135) (owner: 10Bking) [19:29:25] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:30:27] (03PS14) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [19:32:52] 10SRE, 10RESTBase-API, 10Traffic: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Mitar) > Because our edge traffic code enforces a stricter limit of ~100/s (for responses that aren't frontend cache hits due to popularity), before the requests ever get to the Res... [19:33:36] jbond: ah, ACK:) I was trying it here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/790743 can't add the comment line before the shebang lines [19:35:04] !log mforns@deploy1002 Started deploy [analytics/refinery@d2dfced]: Regular analytics weekly train [analytics/refinery@d2dfced] [19:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:05] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:38:27] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1005.wikimedia.org [19:38:29] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudcontrol1005.wikimedia.org [19:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:44] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1005.wikimedia.org [19:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:56] (03CR) 10Ssingh: [C: 03+1] Add 'includes' in private address reverse zones for new subnets [dns] - 10https://gerrit.wikimedia.org/r/790744 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [19:44:29] 10SRE, 10LDAP-Access-Requests, 10User-zeljkofilipin: +2 for esther-akinloose in Gerrit (mediawiki/extensions/VisualEditor) - https://phabricator.wikimedia.org/T305373 (10RLazarus) After consulting with SRE colleagues, I stand corrected -- the email address on the account is fine, and we'll just use the wikim... [19:45:04] (03PS1) 10RLazarus: admin: Add esther-akinloose to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/790748 [19:45:44] (03PS2) 10RLazarus: admin: Add esther-akinloose to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/790748 (https://phabricator.wikimedia.org/T305373) [19:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:47:19] (03CR) 10Ssingh: [C: 03+1] admin: Add esther-akinloose to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/790748 (https://phabricator.wikimedia.org/T305373) (owner: 10RLazarus) [19:48:01] (03PS18) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [19:48:42] (03PS1) 10Jbond: rake: test spdx::check:new_files CI check [puppet] - 10https://gerrit.wikimedia.org/r/790749 [19:49:01] (03CR) 10RLazarus: [C: 03+2] admin: Add esther-akinloose to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/790748 (https://phabricator.wikimedia.org/T305373) (owner: 10RLazarus) [19:49:07] (03PS3) 10RLazarus: admin: Add esther-akinloose to ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/790748 (https://phabricator.wikimedia.org/T305373) [19:49:12] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1004.wikimedia.org [19:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:40] (03CR) 10jerkins-bot: [V: 04-1] rake: test spdx::check:new_files CI check [puppet] - 10https://gerrit.wikimedia.org/r/790749 (owner: 10Jbond) [19:50:55] (03CR) 10Jbond: rake_modules: add check for spdk licence header (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) (owner: 10Jbond) [19:51:24] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1005.wikimedia.org [19:51:24] brett: okay to merge yours? [19:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:31] (03CR) 10Krinkle: gitlab runner: restrict docker images and services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [19:52:09] rzl: Do I have a change still open? [19:52:23] brett: yeah, https://gerrit.wikimedia.org/r/c/operations/puppet/+/790393/ is submitted but not merged on the puppetmaster yet :) [19:52:47] rzl: D'oh! Yes, please merge :) [19:52:51] doing, thanks! [19:54:31] !log mforns@deploy1002 Finished deploy [analytics/refinery@d2dfced]: Regular analytics weekly train [analytics/refinery@d2dfced] (duration: 19m 26s) [19:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:54] (03CR) 10Dzahn: [C: 03+1] "looks good to me! https://puppet-compiler.wmflabs.org/pcc-worker1002/35186/gitlab1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/790699 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [19:54:56] (03PS19) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [19:55:00] (03PS2) 10Peter Bowman: Add localized wordmark for plwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789613 (https://phabricator.wikimedia.org/T307683) [19:55:48] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1003.wikimedia.org [19:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:49] (03PS20) 10Jbond: rake_modules: add check for spdk licence header [puppet] - 10https://gerrit.wikimedia.org/r/786310 (https://phabricator.wikimedia.org/T67270) [20:00:04] RoanKattouw, Urbanecm, and cjming: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220510T2000). [20:00:05] koi: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for RoccoMo - https://phabricator.wikimedia.org/T308053 (10Ottomata) Approved! This will also need kerberos access and LDAP `nda` group membership. [20:00:29] hey [20:00:47] koi: hey, around? [20:00:54] yeah, here [20:01:09] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for RoccoMo - https://phabricator.wikimedia.org/T308053 (10RLazarus) [20:01:19] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for RoccoMo - https://phabricator.wikimedia.org/T308053 (10RLazarus) Thanks both! Proceeding. [20:01:25] apologize that it's a big patch [20:01:37] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review, 10User-zeljkofilipin: +2 for esther-akinloose in Gerrit (mediawiki/extensions/VisualEditor) - https://phabricator.wikimedia.org/T305373 (10RLazarus) 05Open→03Resolved a:03RLazarus Done! ` rzl@mwmaint1002:~$ ldapsearch -x cn=wmf | grep esther-akinloo... [20:02:27] looking [20:03:10] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1004.wikimedia.org [20:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:50] (03PS15) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [20:04:35] (03PS2) 10Jbond: rake: test spdx::check:new_files CI check [puppet] - 10https://gerrit.wikimedia.org/r/790749 [20:04:42] rzl: this will require a +1/discussion in advance, as it introduces a db list (which slows things down a bit, and is usually avoided whenever possible. plus, it's indeed a large patch and those always benefit from more eyes :)) [20:04:48] (03CR) 10Peter Bowman: "Requesting review per similar task 620513 (localized wordmark for eswiktionary)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/789613 (https://phabricator.wikimedia.org/T307683) (owner: 10Peter Bowman) [20:04:52] (03PS5) 10Jbond: apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) [20:05:11] !log mforns@deploy1002 Started deploy [analytics/refinery@d2dfced] (thin): Regular analytics weekly train THIN [analytics/refinery@d2dfced] [20:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:18] !log mforns@deploy1002 Finished deploy [analytics/refinery@d2dfced] (thin): Regular analytics weekly train THIN [analytics/refinery@d2dfced] (duration: 00m 07s) [20:05:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:41] (03CR) 10jerkins-bot: [V: 04-1] rake: test spdx::check:new_files CI check [puppet] - 10https://gerrit.wikimedia.org/r/790749 (owner: 10Jbond) [20:05:46] meh [20:05:52] koi: see above [20:06:25] ack, waiting for +1 [20:06:31] (03CR) 10jerkins-bot: [V: 04-1] apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) (owner: 10Jbond) [20:06:42] !log mforns@deploy1002 Started deploy [analytics/refinery@d2dfced] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d2dfced] [20:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:09] koi: i meant you'd need to reschedule (it's unlikely one will come in the few minutes we've for the B&C window) [20:07:24] 10SRE, 10WMF-General-or-Unknown, 10WMF-Legal, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10RhinosF1) I hereby license all my current and future contributions to the operations/puppet under the Apache 2.0 license. [20:08:34] (03CR) 10Jbond: rake: Add new rake task to convert a module to SPDX (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [20:08:50] well, yeah.. I would wait for some minutes and if no response I would reschedule it [20:09:02] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1003.wikimedia.org [20:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:30] urbanecm would you mind adding someone relevant to reviewer section? [20:10:25] i'll try [20:13:41] !log mforns@deploy1002 Finished deploy [analytics/refinery@d2dfced] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@d2dfced] (duration: 06m 59s) [20:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:51] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 10 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [20:14:45] leaked VMs? hmm [20:15:03] reads runbook [20:16:06] (03PS16) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [20:19:41] andrewbogott: I read the runbook for that cloudcontrol1003 alert. there are 3 failed instances. But I am not sure if those are old or new. [20:20:37] PROBLEM - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is CRITICAL: 11 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [20:21:40] 10SRE, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Patch-For-Review: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Stang) [20:23:04] (03PS1) 10RLazarus: admin: Add mhoutti to analytics-privatedata-users with ssh and kerberos [puppet] - 10https://gerrit.wikimedia.org/r/790754 (https://phabricator.wikimedia.org/T308053) [20:23:06] (03PS17) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [20:23:37] Hey urbanecm, sorry to bother but I have another question about a patch - do you think it is ok to be deployed soon? [20:23:39] https://gerrit.wikimedia.org/r/c/785229 [20:24:29] koi: there's a -1, so definitely not "right away" :) [20:25:00] I know nothing about this but that -1 is not reflecting reality anymore. [20:25:04] task has been closed [20:25:18] should ask the reviewer to consider removing it though [20:25:29] (03PS18) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [20:25:32] true. I still recommend checking with Amir (who gave it originally) and scheduling after that happens :) [20:25:37] this [20:26:47] (03CR) 10Stang: Enable "upload_by_url" feature on zhwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/785229 (https://phabricator.wikimedia.org/T142991) (owner: 10Stang) [20:26:53] nice point, message sent [20:27:20] (y) [20:28:33] hmm, maybe one more question - what do you think of T171140 [20:28:35] T171140: Enable Wikidata support for Outreach Wiki - https://phabricator.wikimedia.org/T171140 [20:28:52] koi: that's true generally speaking btw. I can imagine cases when a patch gets deployed even though there's a -1, but that's really the exception, definitely not the rule [20:29:13] koi: do you have any specific question about that one? [20:29:28] (03PS19) 10Jbond: rake: Add new rake task to convert a module to SPDX [puppet] - 10https://gerrit.wikimedia.org/r/789790 [20:29:53] just wonder what's the meaning of " knowledge about the project's internals" [20:29:59] aren't "SWAT window" and "needs an expert" kind of mutually exclusive? [20:30:29] yes [20:31:00] (03PS6) 10Jbond: apereo_cas: convert module to use SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/790716 (https://phabricator.wikimedia.org/T308013) [20:32:51] (03CR) 10Dzahn: [C: 03+1] "looks good to me. chatted with Isaac a bit about this" [puppet] - 10https://gerrit.wikimedia.org/r/790754 (https://phabricator.wikimedia.org/T308053) (owner: 10RLazarus) [20:33:49] (03CR) 10RLazarus: [C: 03+2] admin: Add mhoutti to analytics-privatedata-users with ssh and kerberos [puppet] - 10https://gerrit.wikimedia.org/r/790754 (https://phabricator.wikimedia.org/T308053) (owner: 10RLazarus) [20:36:31] koi: in this case, it means someone who knows how Wikidata's integration with its clients work internally. as you're aware, WD's integration with client sites is a feature we heavily rely on. any time a feature is touched, we risk it breaks terribly. having a feature broken can have all sorts of different effects: from "site down" over "users can't edit" and "deletion doesn't work" to "the sidebar disappeared" [20:36:49] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:37:12] if WD client site integration stops working, a lot of other things goes down instantly (for example, any page that uses Wikidata suddenly can't parse) [20:38:44] so, it should be touched by expert, to decrease risk of the feature breaking [20:39:16] koi: does that make sense? [20:39:31] yeah, sense making, totally agreed with this point [20:39:58] so do I need to, um, kind of abandon my patch and let someone else rewrite it? [20:40:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for RoccoMo - https://phabricator.wikimedia.org/T308053 (10RLazarus) 05Open→03Resolved a:03RLazarus Added to ldap/nda: ` rzl@mwmaint1002:~$ ldapsearch -x cn=nda | grep mhoutti member: uid=mhoutti,ou=peop... [20:41:13] not necessarily. by "it should be touched by an expert", I meant that an expert will need to do the change (by deploying the patches). that's what can easily get tricky [20:42:51] they could also amend to your existing change if needed [20:43:00] ack, and thanks for the explanation [20:43:35] so to be short, waiting for someone or a pm to handle such stuff [20:43:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [20:44:53] that's two things. Lydia (WD's PM=Product Manager) might say "we don't want WD support to be enabled in this wiki, because XYZ", and that'd mean we can't do it, even if an expert is here [20:45:25] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:45:32] she might say "it's a good idea, and the team will work on it" (which means an engineer from the team will be assigned to it and will work on it) [20:45:53] or she might say "it's a good idea, but the team doesn't have the capacity. if someone wants to work on it, feel free" [20:46:56] koi: so it's more of an "and" rather than an "or" in your summary [20:47:31] aha, got it, two approval required [20:48:27] approval from a PM and then someone realizing it (either from the team, or outside, if they have relevant experience) [20:56:45] (03CR) 10JHathaway: [C: 03+1] "one question, but looks good overall" [puppet] - 10https://gerrit.wikimedia.org/r/789790 (owner: 10Jbond) [20:57:08] 10SRE, 10SRE-swift-storage, 10Commons: Server error 0 after uploading chunk - https://phabricator.wikimedia.org/T307874 (10Yann) Another error while using the same script with the same type of files: `01784: 28/28> in progress Upload: 38% 01814: 28/28> Server error 503 after uploading chunk: Service Unavail... [21:08:11] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:10:01] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python3, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:12:48] (03CR) 10CDanis: "This looks OK but I'll give it another pass my morning tomorrow." [software/conftool] - 10https://gerrit.wikimedia.org/r/789154 (https://phabricator.wikimedia.org/T305607) (owner: 10Giuseppe Lavagetto) [21:23:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Jclark-ctr) cloudrabbit1001 B2 U24 Cableid 5005 Port 29 cloudrabbit1002 C4 U5 Cableid 20220300... [21:23:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [21:30:31] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10RhinosF1) [21:31:45] PROBLEM - SSH on labweb1002.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:31:47] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:32:51] RECOVERY - Check for VMs leaked by the nova-fullstack test on cloudcontrol1003 is OK: 0 instances in the admin-monitoring project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_VMs_leaked_by_the_nova-fullstack_test [21:39:07] mutante: the problem turned out to be more complicated than the cookbook would've known. thanks for looking! [21:39:54] andrewbogott: ok, just saw the recovery. thanks for the follow-up [21:43:18] (03PS1) 10Zabe: swift: migrate container stats cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/790761 (https://phabricator.wikimedia.org/T273673) [21:43:22] (03PS1) 10Zabe: swift: remove absented container stats cron [puppet] - 10https://gerrit.wikimedia.org/r/790762 (https://phabricator.wikimedia.org/T273673) [21:46:09] (03CR) 10Zabe: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/790761 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [21:47:49] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Legoktm) [21:50:01] (03CR) 10Zabe: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/1326/" [puppet] - 10https://gerrit.wikimedia.org/r/790761 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [22:03:27] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 12.68 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:03:47] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 27.39 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:04:37] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is CRITICAL: 41.35 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:06:49] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on alert1001 is OK: (C)60 le (W)70 le 86.53 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:07:53] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:08:13] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 81.67 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [22:32:57] RECOVERY - SSH on labweb1002.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:37:56] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:38:41] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:05:35] (03PS1) 10Dwisehaupt: Turn on monitoring for new frack hosts [puppet] - 10https://gerrit.wikimedia.org/r/790773 [23:20:11] (03PS3) 10BryanDavis: striker: Add profile to provision docker container [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) [23:31:15] (03CR) 10BryanDavis: [C: 04-1] "Needs hiera settings for the cloudweb2002-dev.wikimedia.org staging environment as well." [puppet] - 10https://gerrit.wikimedia.org/r/790012 (https://phabricator.wikimedia.org/T306469) (owner: 10BryanDavis) [23:37:43] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:39:15] PROBLEM - MariaDB Replica IO: s5 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2123.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:39:59] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:41:31] RECOVERY - MariaDB Replica IO: s5 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:46:55] (NodeTextfileStale) firing: Stale textfile for cloudvirt1019:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:57:20] (03PS1) 10Brennen Bearnes: WIP: GitLab: enable container registry (experimental) [puppet] - 10https://gerrit.wikimedia.org/r/790778 (https://phabricator.wikimedia.org/T307537)