[00:39:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:54:24] PROBLEM - Host an-tool1005 is DOWN: PING CRITICAL - Packet loss = 100% [01:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T0100) [01:05:22] PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab1003.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2022:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [01:35:36] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:39:17] 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata) p:05Triage→03Medium [01:43:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:45] (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:05:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:05:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:05:58] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:50] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.12 [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792302 [02:07:54] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.12 [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792302 (owner: 10TrainBranchBot) [02:22:59] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.12 [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792302 (owner: 10TrainBranchBot) [02:27:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:28:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:24:28] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:26:42] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:37:08] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:37:42] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:46:28] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:04:43] (03PS3) 10KartikMistry: Enable Section Translation in bcl, is, ne, pa, ts and ur Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791481 (https://phabricator.wikimedia.org/T304828) [04:24:22] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:32:08] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:34:04] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:38:54] RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:06:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2022:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:20:04] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:23:00] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:23:20] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:44:56] <_joe_> !log restarted rsyslog on kubernetes2022 [05:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:58] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2022:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:54:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:04] kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T0600). [06:12:33] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm) >>! In T306649#7931058, @akosiaris wrote: >> Regarding the "fake nodes": I think that could be done with adding the le... [06:23:38] 10SRE, 10SRE-OnFire, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Joe) 05Open→03Resolved p:05Triage→03High [06:25:43] 10SRE, 10conftool: requestctl v1 improvements - https://phabricator.wikimedia.org/T305580 (10Joe) [06:25:46] 10SRE, 10conftool, 10Patch-For-Review: Provide a meaningful Retry-After value - https://phabricator.wikimedia.org/T305824 (10Joe) 05Open→03Resolved [06:29:04] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) > Plus, they are VMs and we have the same problem we have with the kask dedicated nodes (also VMs). Netbox doesn't have... [06:33:47] (03CR) 10Ayounsi: [C: 03+2] msw: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792204 (owner: 10Ayounsi) [06:34:35] (03Merged) 10jenkins-bot: msw: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792204 (owner: 10Ayounsi) [06:37:57] !log management switches, split configuration per interfaces (use new get_junos_interfaces function) [06:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:26] (03PS4) 10Slyngshede: Move l10nupdate to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) [06:41:47] (03PS5) 10Slyngshede: Move l10nupdate to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) [06:42:06] (03CR) 10Ayounsi: [C: 03+2] mr: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792205 (owner: 10Ayounsi) [06:42:29] (03CR) 10jerkins-bot: [V: 04-1] Move l10nupdate to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [06:42:40] (03Merged) 10jenkins-bot: mr: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792205 (owner: 10Ayounsi) [06:44:26] (03PS6) 10Slyngshede: Move l10nupdate to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) [06:49:22] !log management routers, split configuration per interfaces (use new get_junos_interfaces function) [06:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:17] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove http availability pages, moved to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790671 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [06:52:57] (03PS2) 10WMDE-Fisch: Deploy VE template dialog improvements to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791314 (https://phabricator.wikimedia.org/T306967) [06:53:04] (03PS2) 10WMDE-Fisch: Deploy template search improvements to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791315 (https://phabricator.wikimedia.org/T303802) [06:56:46] (03PS7) 10Slyngshede: Move l10nupdate to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) [06:59:46] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35301/console" [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:00:05] Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T0700). [07:00:05] WMDE-Fisch and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:21] * kart_ is here [07:00:23] \o I can selve serve [07:00:43] * WMDE-Fisch starts [07:00:43] Cool. Please go ahead and let me know once done. [07:00:51] * urbanecm waves too [07:00:59] But leaves WMDE-Fisch to self serve :)) [07:01:56] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:02:06] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791314 (https://phabricator.wikimedia.org/T306967) (owner: 10WMDE-Fisch) [07:03:08] (03Merged) 10jenkins-bot: Deploy VE template dialog improvements to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791314 (https://phabricator.wikimedia.org/T306967) (owner: 10WMDE-Fisch) [07:03:30] (03CR) 10Slyngshede: [V: 03+1] "Fixed comments." [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:04:40] * WMDE-Fisch testing first patch on debug1001 [07:06:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:07:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:46] !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791314|Deploy VE template dialog improvements to enwiki (T306967)]] (duration: 00m 50s) [07:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:50] T306967: Deploy VE template dialog improvements to enwiki - https://phabricator.wikimedia.org/T306967 [07:08:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:08:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:32] * WMDE-Fisch 1st patch seems fine... moving on [07:11:48] (03PS1) 10Giuseppe Lavagetto: requestctl: fix small issues with the VSL translations [software/conftool] - 10https://gerrit.wikimedia.org/r/792555 [07:11:50] (03PS1) 10Giuseppe Lavagetto: New version [software/conftool] - 10https://gerrit.wikimedia.org/r/792556 [07:11:52] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791315 (https://phabricator.wikimedia.org/T303802) (owner: 10WMDE-Fisch) [07:12:38] (03Merged) 10jenkins-bot: Deploy template search improvements to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791315 (https://phabricator.wikimedia.org/T303802) (owner: 10WMDE-Fisch) [07:12:50] (03CR) 10Ayounsi: [C: 03+2] cr: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792208 (owner: 10Ayounsi) [07:13:28] (03Merged) 10jenkins-bot: cr: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792208 (owner: 10Ayounsi) [07:14:07] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/792232 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey) [07:14:23] Testing on debug1001 [07:16:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792284 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:17:22] !log core routers, split configuration per interfaces (use new get_junos_interfaces function) [07:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:19:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:18] (03PS1) 10Jaime Nuche: testwikis wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792557 [07:20:20] (03CR) 10Jaime Nuche: [C: 03+2] testwikis wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792557 (owner: 10Jaime Nuche) [07:20:31] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:20:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, all commits with a @wikimedia.org address, I'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/792282 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:20:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:20:46] !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791315|Deploy template search improvements to enwiki (T303802)]] (duration: 02m 11s) [07:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:51] (03PS6) 10Slyngshede: Move Carbon Cache log cleanup to systemd tmpfile. [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) [07:20:52] T303802: Deploy template search improvements to enwiki - https://phabricator.wikimedia.org/T303802 [07:20:59] synced, final tests [07:21:08] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792557 (owner: 10Jaime Nuche) [07:21:10] (03CR) 10Muehlenhoff: [C: 03+2] zookeeper: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792282 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:22:01] All good, I'm done! [07:22:07] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.12 refs T305218 [07:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:13] T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218 [07:22:26] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35302/console" [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:22:49] WMDE-Fisch: Thanks. I'll also self-deploy.. [07:23:00] kart_: Great! [07:23:21] (03CR) 10KartikMistry: [C: 03+2] Enable Section Translation in bcl, is, ne, pa, ts and ur Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791481 (https://phabricator.wikimedia.org/T304828) (owner: 10KartikMistry) [07:23:29] (03PS4) 10KartikMistry: Enable Section Translation in bcl, is, ne, pa, ts and ur Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791481 (https://phabricator.wikimedia.org/T304828) [07:23:56] (03CR) 10Slyngshede: [V: 03+1] "Switched patch from systemd timers to systemd tmpfile instead." [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:25:14] good morning! Hi jnuche :) [07:25:15] (03CR) 10Slyngshede: Update statistics::rsync::published to use SystemD timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789570 (https://phabricator.wikimedia.org/T123456) (owner: 10Slyngshede) [07:25:43] morning hashar! 👋 [07:25:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:25:47] I guess I missed we run the train at 9am cest [07:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:26:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:49] just the prep as usual I thought, the first deploy to group0 will start after 10 cest as usual [07:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:30] for the risky patches, the new extension `SimilarEditors` should not cause any issue. It is merely shipping files that are not in use anywhere so people can "easily" turn on the extension whenever they are around [07:27:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:27:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:09] the other with database lazy connections, well I don't know. It sounds like every time we touch that area of code there is some surprising side effect spurting out :] [07:28:30] anyway they seem fine :] [07:29:22] 'scap pull' on mwdebug1001 taking longer than usual time.. [07:29:31] OK. now finished. [07:30:07] kart_: the longer scap pull I think that is because the new mediawiki 1.39.0-wmf.12 is on the deploy server [07:30:20] so it takes 2/3 minutes to rsync all of mediawiki code + l10n cache [07:31:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/792253 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:32:55] hashar: What's this? `07:32:36 sync-file failed: Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "jnuche"; reason is "testwikis wikis to 1.39.0-wmf.12 refs T305218"` [07:32:55] T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218 [07:33:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:32] (03PS1) 10Filippo Giunchedi: prometheus: fix node-exim-queue' stats on no matches [puppet] - 10https://gerrit.wikimedia.org/r/792558 (https://phabricator.wikimedia.org/T305847) [07:33:49] jnuche ^^ We've config deployment window is on. [07:34:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:34:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:14] kart_: sorry, that must be the train staging that I'm running right now, I didn't know it could affect the other deployment window [07:34:37] Yes. We've Window for that reason :) [07:34:57] I'll just cancel [07:35:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:35:00] !log jnuche@deploy1002 deploy-promote aborted: (duration: 14m 44s) [07:35:01] !log jnuche@deploy1002 stage-train aborted: (duration: 25m 33s) [07:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:18] kart_: done, you should be able to continue now [07:35:20] sorry about that [07:35:32] I'll wait until you're done [07:35:34] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix node-exim-queue' stats on no matches [puppet] - 10https://gerrit.wikimedia.org/r/792558 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [07:35:40] (03PS2) 10Filippo Giunchedi: prometheus: fix node-exim-queue' stats on no matches [puppet] - 10https://gerrit.wikimedia.org/r/792558 (https://phabricator.wikimedia.org/T305847) [07:35:41] jnuche: Thanks and no problem! I'll take few minutes only.. [07:35:56] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MoritzMuehlenhoff) [07:36:39] !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791481|Enable Section Translation in bcl, is, ne, pa, ts and ur Wikipedias (T304828)]] (duration: 00m 53s) [07:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:44] T304828: Enable Section Translation in 13 wikis where Content Translation is already available as default - https://phabricator.wikimedia.org/T304828 [07:36:56] !log UTC morning backport window - Done. [07:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:02] jnuche: You can go ahead. [07:37:34] (03CR) 10Slyngshede: [C: 03+2] Update statistics::publishd to use SystemD timers, rather than cron. [puppet] - 10https://gerrit.wikimedia.org/r/789599 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:37:36] kart_: thanks! [07:38:34] (03CR) 10Ayounsi: [C: 03+2] commons: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792210 (owner: 10Ayounsi) [07:39:13] (03Merged) 10jenkins-bot: commons: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792210 (owner: 10Ayounsi) [07:39:22] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.12 refs T305218 [07:39:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:27] T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218 [07:41:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792178 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [07:43:48] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MoritzMuehlenhoff) >>! In T308013#7931137, @jcrespo wrote: >> Apache 2 seems to be used by puppet and the puppet modules, it retains the copyright so that seems fine to me... [07:45:00] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MoritzMuehlenhoff) [07:48:01] (03PS1) 10KartikMistry: Enable Section Translation in as, gu, kn, mk and, mr Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792559 (https://phabricator.wikimedia.org/T304828) [07:49:40] (03PS1) 10Slyngshede: Redirection not available for systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792560 [07:51:56] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:53:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/792560 (owner: 10Slyngshede) [07:53:28] (03CR) 10Slyngshede: [C: 03+2] Redirection not available for systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792560 (owner: 10Slyngshede) [07:53:58] !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.12 refs T305218 (duration: 14m 35s) [07:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:03] T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218 [07:54:36] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. One other option: Since prometheus-labs-targets.py is already shipped by us via Puppet we could also simply add a new option t" [puppet] - 10https://gerrit.wikimedia.org/r/792185 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:56:11] (03CR) 10DCausse: [C: 03+1] rdf query service: Apply WARN log level only to com.bigdata [puppet] - 10https://gerrit.wikimedia.org/r/792266 (https://phabricator.wikimedia.org/T306899) (owner: 10Ebernhardson) [07:56:15] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792177 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [07:56:54] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792175 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [07:57:43] (03PS2) 10Ladsgroup: orchestrator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792178 (https://phabricator.wikimedia.org/T308013) [07:57:45] (03PS1) 10Ayounsi: evpn: use get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792561 [07:58:17] (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers to debdeploy/adduser/puppetboard modules [puppet] - 10https://gerrit.wikimedia.org/r/791596 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:58:30] (03CR) 10Ladsgroup: [C: 03+2] orchestrator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792178 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [07:59:18] (03PS2) 10Ladsgroup: dbtree: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792175 (https://phabricator.wikimedia.org/T308013) [08:00:05] jnuche and hashar: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T0800). [08:00:11] (03CR) 10Ladsgroup: [C: 03+2] dbtree: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792175 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [08:00:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:28] (03PS2) 10Ladsgroup: proxysql: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792177 (https://phabricator.wikimedia.org/T308013) [08:00:31] (03CR) 10Ayounsi: [C: 03+2] evpn: use get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792561 (owner: 10Ayounsi) [08:01:14] (03Merged) 10jenkins-bot: evpn: use get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792561 (owner: 10Ayounsi) [08:01:14] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:01:16] (03CR) 10Ladsgroup: [C: 03+2] proxysql: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792177 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [08:03:38] (03PS1) 10Jaime Nuche: group0 wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792563 [08:03:40] (03CR) 10Jaime Nuche: [C: 03+2] group0 wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792563 (owner: 10Jaime Nuche) [08:04:21] (03PS2) 10Filippo Giunchedi: sre: port mediawiki php-fpm saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/791356 (https://phabricator.wikimedia.org/T305847) [08:04:23] (03PS1) 10Filippo Giunchedi: sre: port mx queue high page [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) [08:04:25] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.12 refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792563 (owner: 10Jaime Nuche) [08:05:34] (03CR) 10Filippo Giunchedi: "Note that we're not ready yet to merge this (not enough data in the metric IMHO), however I wanted to put it out there for your considerat" [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:05:58] (03CR) 10Jcrespo: "Shouldn't dbtree be removed from puppet instead?" [puppet] - 10https://gerrit.wikimedia.org/r/792175 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [08:06:02] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.12 refs T305218 [08:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:08] T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218 [08:06:24] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/789570 (https://phabricator.wikimedia.org/T123456) (owner: 10Slyngshede) [08:07:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:07:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:07:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:25] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:08:36] !log installing ffmpeg security updates on stretch [08:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:51] (03PS1) 10Ladsgroup: Turn on read new for templatelinks on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792565 (https://phabricator.wikimedia.org/T306673) [08:13:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:45] (03CR) 10Marostegui: [C: 03+2] admin: add Antoine Musso to Phabricator hosts [puppet] - 10https://gerrit.wikimedia.org/r/792270 (https://phabricator.wikimedia.org/T308478) (owner: 10Hashar) [08:14:07] (03CR) 10Ladsgroup: [C: 03+2] dbtree: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792175 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [08:15:04] 10SRE, 10SRE-Access-Requests, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: Add Antoine Musso to Phabricator hosts - https://phabricator.wikimedia.org/T308478 (10Marostegui) 05Open→03Resolved a:03Marostegui merged the change and ran puppet on phab1001.eqiad.wmnet [08:15:48] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10Marostegui) p:05Triage→03Medium [08:16:03] 10SRE, 10DBA, 10Wikimedia-Incident, 10Wikimedia-production-error: 2022-05-14 Databases - https://phabricator.wikimedia.org/T308380 (10Marostegui) p:05Triage→03Medium [08:16:15] (03CR) 10Volans: "addressed comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans) [08:16:39] (03PS4) 10Volans: service: add new module to expose service::catalog [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 [08:17:07] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10wiki_willy) a:03Jclark-ctr [08:17:53] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) p:05Triage→03Medium [08:17:57] (03CR) 10Slyngshede: [C: 03+2] Move automated target generation of Prometheus targets to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792185 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:18:07] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Marostegui) p:05Triage→03Medium [08:18:17] 10SRE, 10Infrastructure-Foundations: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10Marostegui) p:05Triage→03Medium [08:18:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:34] 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: swift-account-stats failures on thanos-swift - https://phabricator.wikimedia.org/T307907 (10Marostegui) p:05Triage→03Medium [08:18:51] 10SRE, 10Traffic, 10Patch-For-Review: Implement SLI measurement for HAProxy - https://phabricator.wikimedia.org/T307898 (10Marostegui) p:05Triage→03Medium [08:19:11] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: contint/releases/hosts with helm installed: puppet - Could not find group deployment - https://phabricator.wikimedia.org/T307740 (10Marostegui) p:05Triage→03Medium [08:19:17] 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10wiki_willy) Hi @Jclark-ctr - this one is out of warranty, but let me know if you have any spares around or if we should purchase one. Thanks, Willy [08:19:29] 10SRE, 10RESTBase-API, 10Traffic, 10Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Marostegui) p:05Triage→03Medium [08:20:15] 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10Marostegui) p:05Triage→03Medium [08:20:29] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10Marostegui) p:05Triage→03Medium [08:20:43] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Betacommons: 504, Connection Timed Out at 2022-05-02 13:35:16 GMT - https://phabricator.wikimedia.org/T307354 (10Marostegui) p:05Triage→03Medium [08:21:04] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@b569ee8]: Update DAG spark conf [airflow-dags/analytics@b569ee8] [08:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:11] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@b569ee8]: Update DAG spark conf [airflow-dags/analytics@b569ee8] (duration: 00m 07s) [08:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:58] (03PS1) 10Ayounsi: switches: use get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792566 [08:24:21] (03CR) 10Slyngshede: [C: 03+2] Update statistics::rsync::published to use SystemD timers [puppet] - 10https://gerrit.wikimedia.org/r/789570 (https://phabricator.wikimedia.org/T123456) (owner: 10Slyngshede) [08:25:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:25:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:18] (03CR) 10Marostegui: auto_schema: Make alter non-blocking on master of primary dc (031 comment) [software] - 10https://gerrit.wikimedia.org/r/791297 (owner: 10Ladsgroup) [08:28:11] (03PS1) 10Cathal Mooney: Add policer config to swithes [homer/public] - 10https://gerrit.wikimedia.org/r/792567 [08:28:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:41] (03PS2) 10Muehlenhoff: Remove webperf1001/2001 from Scap config [puppet] - 10https://gerrit.wikimedia.org/r/791300 [08:30:05] jouncebot: nowandnext [08:30:05] For the next 1 hour(s) and 29 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T0800) [08:30:05] In 4 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T1300) [08:32:25] (03PS1) 10Ladsgroup: ContribsPager: Update index hint to use revision table in READ NEW [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792474 (https://phabricator.wikimedia.org/T307295) [08:32:39] (03PS1) 10Ladsgroup: ContribsPager: Update index hint to use revision table in READ NEW [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792475 (https://phabricator.wikimedia.org/T307295) [08:32:47] (03CR) 10Ladsgroup: [C: 03+2] ContribsPager: Update index hint to use revision table in READ NEW [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792475 (https://phabricator.wikimedia.org/T307295) (owner: 10Ladsgroup) [08:33:15] (03CR) 10Muehlenhoff: [C: 03+2] Remove webperf1001/2001 from Scap config [puppet] - 10https://gerrit.wikimedia.org/r/791300 (owner: 10Muehlenhoff) [08:33:37] jnuche: hi, I'm going to backport some stuff, are you done with the train? [08:34:32] Amir1: hi, yeah, you can go aead [08:34:36] *ahead [08:34:48] Thanks! [08:34:57] (03CR) 10Ladsgroup: [C: 03+2] ContribsPager: Update index hint to use revision table in READ NEW [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792474 (https://phabricator.wikimedia.org/T307295) (owner: 10Ladsgroup) [08:35:51] (03CR) 10Ladsgroup: [C: 03+2] Turn on read new for templatelinks on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792565 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [08:37:25] (03Merged) 10jenkins-bot: Turn on read new for templatelinks on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792565 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup) [08:38:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:31] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:792565|Turn on read new for templatelinks on frwiki (T306673)]] (duration: 02m 25s) [08:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:36] T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673 [08:43:54] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:45:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [08:45:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [08:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 5%: After depooling', diff saved to https://phabricator.wikimedia.org/P27833 and previous config saved to /var/cache/conftool/dbconfig/20220517-084704-root.json [08:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:51] 10SRE, 10DBA, 10Wikimedia-Incident, 10Wikimedia-production-error: 2022-05-14 Databases - https://phabricator.wikimedia.org/T308380 (10Marostegui) I have tweaked db1172's weight and I am slowly repooling it [08:48:22] RECOVERY - Host an-tool1007 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [08:48:28] !log jmm@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti4002.ulsfo.wmnet with OS bullseye [08:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:36] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4002.ulsfo.wmnet with OS bullseye [08:49:40] PROBLEM - turnilo.wikimedia.org requires authentication on an-tool1007 is CRITICAL: connect to address 10.64.36.118 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [08:50:00] PROBLEM - turnilo.wikimedia.org tls expiry on an-tool1007 is CRITICAL: connect to address 10.64.36.118 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [08:50:00] (03CR) 10Slyngshede: [C: 03+2] Move Wiki Rsync fetch jobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790967 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [08:50:06] ACKNOWLEDGEMENT - turnilo.wikimedia.org requires authentication on an-tool1007 is CRITICAL: connect to address 10.64.36.118 and port 443: Connection refused Btullis Working on the upgrade in T301990 https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [08:50:06] ACKNOWLEDGEMENT - turnilo.wikimedia.org tls expiry on an-tool1007 is CRITICAL: connect to address 10.64.36.118 and port 443: Connection refused Btullis Working on the upgrade in T301990 https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [08:51:16] (03PS2) 10Ladsgroup: mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) [08:51:23] (03PS3) 10Ladsgroup: mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) [08:52:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [08:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:30] (03Merged) 10jenkins-bot: ContribsPager: Update index hint to use revision table in READ NEW [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792475 (https://phabricator.wikimedia.org/T307295) (owner: 10Ladsgroup) [08:52:36] (03Merged) 10jenkins-bot: ContribsPager: Update index hint to use revision table in READ NEW [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792474 (https://phabricator.wikimedia.org/T307295) (owner: 10Ladsgroup) [08:54:31] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.12/includes/specials/pagers/ContribsPager.php: Backport: [[gerrit:792475|ContribsPager: Update index hint to use revision table in READ NEW (T307295)]] (duration: 00m 56s) [08:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:36] T307295: Bot contributions page in Catalan wikipedia not displayed - https://phabricator.wikimedia.org/T307295 [08:57:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [08:57:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:04] !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.10/includes/specials/pagers/ContribsPager.php: Backport: [[gerrit:792474|ContribsPager: Update index hint to use revision table in READ NEW (T307295)]] (duration: 00m 53s) [08:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] Don't schedule calico kube-controllers on master nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/777364 (owner: 10JMeybohm) [09:01:40] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7933266, @JMeybohm wrote: >>>! In T306649#7931058, @akosiaris wrote: >>> Regarding the "fake nodes": I... [09:02:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 10%: After depooling', diff saved to https://phabricator.wikimedia.org/P27834 and previous config saved to /var/cache/conftool/dbconfig/20220517-090208-root.json [09:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:04:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:54] PROBLEM - puppet last run on an-tool1007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.118: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:04:58] PROBLEM - Check systemd state on an-tool1007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.118: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:05:14] PROBLEM - Check the NTP synchronisation status of timesyncd on an-tool1007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.118: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP [09:05:22] PROBLEM - Check that envoy is running on an-tool1007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.118: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:05:40] PROBLEM - DPKG on an-tool1007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.118: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:05:56] !log jmm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4002.ulsfo.wmnet with reason: host reimage [09:05:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:32] RECOVERY - Check systemd state on an-tool1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:07:00] RECOVERY - Check that envoy is running on an-tool1007 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [09:09:21] (03CR) 10Muehlenhoff: aptrepo: import gitlab package for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792108 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto) [09:09:36] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4002.ulsfo.wmnet with reason: host reimage [09:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:17] RECOVERY - puppet last run on an-tool1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:10:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:51] RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:11:05] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792125 (https://phabricator.wikimedia.org/T304497) (owner: 10Volans) [09:12:40] (03PS1) 10Filippo Giunchedi: mx: remove queue size alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847) [09:12:42] (03PS1) 10Filippo Giunchedi: fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) [09:13:20] (03CR) 10jerkins-bot: [V: 04-1] mx: remove queue size alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:13:39] (03PS1) 10Marostegui: Revert "db1164: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/792476 [09:13:48] (03CR) 10jerkins-bot: [V: 04-1] fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:14:31] (03CR) 10Cathal Mooney: [C: 03+2] Add new cloudsw to rancid for config backup [puppet] - 10https://gerrit.wikimedia.org/r/791600 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [09:14:34] (03CR) 10Marostegui: [C: 03+2] Revert "db1164: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/792476 (owner: 10Marostegui) [09:14:49] (03CR) 10Volans: [C: 03+2] "PCC happy: https://puppet-compiler.wmflabs.org/pcc-worker1001/35304/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/792125 (https://phabricator.wikimedia.org/T304497) (owner: 10Volans) [09:15:05] (03PS3) 10Filippo Giunchedi: mediawiki: remove idle php-fpm workers alert, moved to prometheus/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/791360 (https://phabricator.wikimedia.org/T305847) [09:15:07] (03PS2) 10Filippo Giunchedi: mx: remove queue size alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847) [09:15:09] (03PS2) 10Filippo Giunchedi: fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) [09:15:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [09:15:28] (03CR) 10Muehlenhoff: "I had no idea the same cron was also copied over to the other role. We can properly address this by reducing code duplication: If we creat" [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:11] (03CR) 10jerkins-bot: [V: 04-1] mx: remove queue size alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:16:14] !log btullis@deploy1002 Started deploy [analytics/turnilo/deploy@bf60521]: (no justification provided) [09:16:17] !log btullis@deploy1002 Finished deploy [analytics/turnilo/deploy@bf60521]: (no justification provided) (duration: 00m 03s) [09:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:18] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [09:16:19] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [09:16:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: After depooling', diff saved to https://phabricator.wikimedia.org/P27835 and previous config saved to /var/cache/conftool/dbconfig/20220517-091712-root.json [09:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:43] (03CR) 10Ayounsi: [C: 03+2] switches: use get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792566 (owner: 10Ayounsi) [09:19:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [09:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:59] RECOVERY - turnilo.wikimedia.org requires authentication on an-tool1007 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 546 bytes in 1.006 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [09:20:20] (03Merged) 10jenkins-bot: switches: use get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792566 (owner: 10Ayounsi) [09:20:54] !log all switches, split configuration per interfaces (use new get_junos_interfaces function) [09:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:05] RECOVERY - Disk space on ms-be1040 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1040&var-datasource=eqiad+prometheus/ops [09:22:35] RECOVERY - turnilo.wikimedia.org tls expiry on an-tool1007 is OK: OK - Certificate yarn.wikimedia.org will expire on Sat 01 May 2027 07:37:58 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [09:24:13] (03CR) 10Elukey: [C: 03+1] Don't schedule calico kube-controllers on master nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/777364 (owner: 10JMeybohm) [09:25:10] !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4002.ulsfo.wmnet with OS bullseye [09:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:16] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4002.ulsfo.wmnet with OS bullseye completed: - ganeti4002 (**PASS**) - Downtimed on Icinga/Aler... [09:26:58] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] Move Carbon Cache log cleanup to systemd tmpfile. [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [09:32:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: After depooling', diff saved to https://phabricator.wikimedia.org/P27836 and previous config saved to /var/cache/conftool/dbconfig/20220517-093216-root.json [09:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:37] (03PS1) 10Ayounsi: switch interfaces: sort vlans [homer/public] - 10https://gerrit.wikimedia.org/r/792571 [09:36:25] RECOVERY - Check the NTP synchronisation status of timesyncd on an-tool1007 is OK: OK: synced at Tue 2022-05-17 09:36:24 UTC. https://wikitech.wikimedia.org/wiki/NTP [09:36:46] (03CR) 10Ayounsi: [C: 03+2] switch interfaces: sort vlans [homer/public] - 10https://gerrit.wikimedia.org/r/792571 (owner: 10Ayounsi) [09:36:53] RECOVERY - DPKG on an-tool1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [09:37:20] (03Merged) 10jenkins-bot: switch interfaces: sort vlans [homer/public] - 10https://gerrit.wikimedia.org/r/792571 (owner: 10Ayounsi) [09:39:49] (03CR) 10Alexandros Kosiaris: [C: 03+1] Reduce the scope of Calico's global BGP Peers for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/792232 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey) [09:44:57] PROBLEM - Check systemd state on an-tool1007 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:45:05] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:47:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: After depooling', diff saved to https://phabricator.wikimedia.org/P27837 and previous config saved to /var/cache/conftool/dbconfig/20220517-094719-root.json [09:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:55] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:54:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:55:32] (03CR) 10Jbond: [C: 04-1] "Thanks for the patch, very much appreciated, but i wonder if this is the file you had intended to patch. software/puppet-compiler is the " [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [10:00:25] 10SRE, 10Infrastructure-Foundations: puppetmaster1001 disk warning on / - https://phabricator.wikimedia.org/T304898 (10Marostegui) 05Open→03Resolved a:03MoritzMuehlenhoff @MoritzMuehlenhoff dropped a bunch of `/tmp/tmp.*` and the disk is back to 64%: ` root@puppetmaster1001:/var/log/apache2# df -hT / Fil... [10:02:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: After depooling', diff saved to https://phabricator.wikimedia.org/P27838 and previous config saved to /var/cache/conftool/dbconfig/20220517-100223-root.json [10:02:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:45] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792253 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:03:02] (03CR) 10Jbond: [C: 03+2] C:helm: make the group permissions on helm_cache configurable [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond) [10:04:53] (03CR) 10Jbond: [C: 03+2] codesearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792284 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:05:00] (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/792284 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:05:42] (03CR) 10Jbond: [C: 03+2] libraryupgrader: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792253 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [10:09:08] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans) [10:09:57] (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [10:15:54] (03PS1) 10Jbond: admin - gitlab-roots: add *contint_roots_members to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/792576 (https://phabricator.wikimedia.org/T308350) [10:16:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4002.ulsfo.wmnet [10:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:45] (03PS1) 10JMeybohm: Add a rake task to generate JSON schema for chart CRDs on the fly [deployment-charts] - 10https://gerrit.wikimedia.org/r/792577 (https://phabricator.wikimedia.org/T306165) [10:20:13] (03PS3) 10Filippo Giunchedi: fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) [10:21:05] (03PS2) 10JMeybohm: Add a rake task to generate JSON schema for chart CRDs on the fly [deployment-charts] - 10https://gerrit.wikimedia.org/r/792577 (https://phabricator.wikimedia.org/T306165) [10:22:53] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4002.ulsfo.wmnet [10:24:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:57] (03PS3) 10JMeybohm: Add a rake task to generate JSON schema for chart CRDs on the fly [deployment-charts] - 10https://gerrit.wikimedia.org/r/792577 (https://phabricator.wikimedia.org/T306165) [10:29:19] (03CR) 10Cathal Mooney: Add new subnets for cloudsw expansion Eqiad to netops infrastructure (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [10:31:03] (03CR) 10Jcrespo: [C: 03+1] mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [10:32:01] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35306/console" [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [10:32:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4002.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [10:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:27] (03PS4) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 [10:32:29] (03PS3) 10Hnowlan: Set production role and add config for restbase2027 [puppet] - 10https://gerrit.wikimedia.org/r/779846 [10:32:50] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4002.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet [10:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:11] (03CR) 10jerkins-bot: [V: 04-1] sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond) [10:36:19] PROBLEM - Disk space on ms-be1040 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1040&var-datasource=eqiad+prometheus/ops [10:41:04] (03PS3) 10Jbond: dhcp: DHCPConfOpt82 media_type parameter [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 [10:41:09] (03CR) 10Jbond: [C: 03+1] "updated thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 (owner: 10Jbond) [10:41:59] (03PS1) 10Muehlenhoff: Add SPDX headers for routinator/diffscan/bgpalerter/gobgpd/homer [puppet] - 10https://gerrit.wikimedia.org/r/792579 (https://phabricator.wikimedia.org/T308013) [10:50:09] (03PS2) 10Giuseppe Lavagetto: New version [software/conftool] - 10https://gerrit.wikimedia.org/r/792556 [10:50:11] (03PS1) 10Giuseppe Lavagetto: requestctl: always add header for detection [software/conftool] - 10https://gerrit.wikimedia.org/r/792580 [10:50:13] (03PS1) 10Giuseppe Lavagetto: requestctl: do not ask for confirmation for emtpy changes [software/conftool] - 10https://gerrit.wikimedia.org/r/792581 [10:55:34] (03PS4) 10Ladsgroup: mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) [10:55:40] (03CR) 10Ladsgroup: [C: 03+2] mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup) [10:58:53] (03PS2) 10Slyngshede: Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) [10:59:13] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:59:29] (03CR) 10jerkins-bot: [V: 04-1] Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:00:58] (03PS3) 10Slyngshede: Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) [11:01:35] (03CR) 10jerkins-bot: [V: 04-1] Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:47] (03PS4) 10Slyngshede: Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) [11:07:35] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35307/console" [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:09:47] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35308/console" [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:14:46] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792579 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:16:12] (03CR) 10Slyngshede: Move restart of slapd, due to memory leaks, to systemd timers. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:22:46] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:44:57] (03CR) 10Slyngshede: [V: 03+1] Move rabbitmq to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791367 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [11:49:17] (03PS3) 10Filippo Giunchedi: mx: remove queue size alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847) [11:49:19] (03PS4) 10Filippo Giunchedi: fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) [11:51:43] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:53:51] !log failover Ganeti master in ulsfo to ganeti4001 T307997 [11:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:53:58] T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 [11:55:07] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:57:57] PROBLEM - ganeti-wconfd running on ganeti4003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [11:58:42] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:00:10] (03CR) 10Filippo Giunchedi: [C: 03+2] fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [12:00:15] (03PS5) 10Filippo Giunchedi: fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) [12:00:25] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:02:40] (03PS1) 10Jbond: redfish: add support to upload files via the request method [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 [12:04:16] !log draining ganeti4003 T307997 [12:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:21] T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 [12:14:11] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10awight) [12:19:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:19:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:57] !log hnowlan@deploy1002 Started deploy [restbase/deploy@6e39559]: Add kcgwiki - T305281 [12:19:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:04] T305281: Post-creation work for kcgwiki - https://phabricator.wikimedia.org/T305281 [12:21:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [12:21:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [12:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27840 and previous config saved to /var/cache/conftool/dbconfig/20220517-122201-ladsgroup.json [12:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:14] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [12:25:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27841 and previous config saved to /var/cache/conftool/dbconfig/20220517-122517-ladsgroup.json [12:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:31] (03CR) 10Ladsgroup: [C: 03+2] fix_logging.log_timestamp_type_T298555.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) (owner: 10Kormat) [12:26:01] (03CR) 10Ladsgroup: [C: 03+2] fix_revision.rev_timestamp_type_T298560.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788290 (https://phabricator.wikimedia.org/T298560) (owner: 10Kormat) [12:27:34] (03Merged) 10jenkins-bot: fix_logging.log_timestamp_type_T298555.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) (owner: 10Kormat) [12:27:38] (03Merged) 10jenkins-bot: fix_revision.rev_timestamp_type_T298560.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788290 (https://phabricator.wikimedia.org/T298560) (owner: 10Kormat) [12:36:44] (03CR) 10Volans: [C: 03+1] "LGTM, feel free to merge as is. Couple of questions inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond) [12:39:49] (03CR) 10Ottomata: [C: 03+1] Move Hadoop eventlogs cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [12:39:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:39:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27842 and previous config saved to /var/cache/conftool/dbconfig/20220517-124022-ladsgroup.json [12:40:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:09] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The following units failed: swift-object.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:42:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:42:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:42:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298560)', diff saved to https://phabricator.wikimedia.org/P27843 and previous config saved to /var/cache/conftool/dbconfig/20220517-124227-ladsgroup.json [12:42:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:34] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [12:44:47] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:47:15] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 (owner: 10Jbond) [12:55:04] (03CR) 10Kosta Harlan: "should we consider adding the messages on wiki via MediaWiki namespace, as syncing the i18n updates is time-consuming?" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792478 (https://phabricator.wikimedia.org/T305659) (owner: 10Gergő Tisza) [12:55:21] (03CR) 10CDanis: [C: 03+1] requestctl: fix small issues with the VSL translations [software/conftool] - 10https://gerrit.wikimedia.org/r/792555 (owner: 10Giuseppe Lavagetto) [12:55:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27844 and previous config saved to /var/cache/conftool/dbconfig/20220517-125527-ladsgroup.json [12:55:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:33] (03CR) 10Ayounsi: [C: 03+1] Add SPDX headers for routinator/diffscan/bgpalerter/gobgpd/homer [puppet] - 10https://gerrit.wikimedia.org/r/792579 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:55:37] (03CR) 10CDanis: [C: 03+1] requestctl: always add header for detection [software/conftool] - 10https://gerrit.wikimedia.org/r/792580 (owner: 10Giuseppe Lavagetto) [12:56:09] (03CR) 10CDanis: [C: 03+1] requestctl: do not ask for confirmation for emtpy changes [software/conftool] - 10https://gerrit.wikimedia.org/r/792581 (owner: 10Giuseppe Lavagetto) [12:56:27] (03CR) 10CDanis: [C: 03+1] "please fix then lgtm" [software/conftool] - 10https://gerrit.wikimedia.org/r/792556 (owner: 10Giuseppe Lavagetto) [12:57:00] (03PS1) 10Zabe: wmfmariadbpy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792607 (https://phabricator.wikimedia.org/T308013) [12:57:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [12:57:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance [12:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T300774)', diff saved to https://phabricator.wikimedia.org/P27845 and previous config saved to /var/cache/conftool/dbconfig/20220517-125713-ladsgroup.json [12:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:18] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [12:57:19] (03CR) 10Ayounsi: [C: 03+1] Add policer config to swithes [homer/public] - 10https://gerrit.wikimedia.org/r/792567 (owner: 10Cathal Mooney) [12:58:11] (03CR) 10Elukey: [C: 03+2] Reduce the scope of Calico's global BGP Peers for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/792232 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey) [12:58:34] (03CR) 10Jcrespo: [C: 03+1] "I did a manual run of backups for both clients. They got the following amount of data:" [puppet] - 10https://gerrit.wikimedia.org/r/792125 (https://phabricator.wikimedia.org/T304497) (owner: 10Volans) [12:58:47] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/787831 (owner: 10PipelineBot) [12:59:12] (03PS1) 10Zabe: wikilabels: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792608 (https://phabricator.wikimedia.org/T308013) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T1300). [13:00:04] tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:01:10] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:27] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:45] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:02:15] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:02:41] I'll deploy [13:02:51] !log killed cawiki's refreshLinkRecommendations.php (T299021) [13:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:57] T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021 [13:03:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T300774)', diff saved to https://phabricator.wikimedia.org/P27846 and previous config saved to /var/cache/conftool/dbconfig/20220517-130322-ladsgroup.json [13:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:28] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:03:53] (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/787831 (owner: 10PipelineBot) [13:04:08] (03PS1) 10Zabe: visualdiff: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792609 (https://phabricator.wikimedia.org/T308013) [13:05:37] (03CR) 10Gergő Tisza: Account creation: add Thank you banner texts (031 comment) [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792478 (https://phabricator.wikimedia.org/T305659) (owner: 10Gergő Tisza) [13:05:42] (03CR) 10Gergő Tisza: [C: 03+2] Account creation: add Thank you banner texts [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792478 (https://phabricator.wikimedia.org/T305659) (owner: 10Gergő Tisza) [13:09:33] (03CR) 10Volans: [C: 03+2] cluster::management: backup auditing logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792125 (https://phabricator.wikimedia.org/T304497) (owner: 10Volans) [13:10:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27848 and previous config saved to /var/cache/conftool/dbconfig/20220517-131032-ladsgroup.json [13:10:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [13:10:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [13:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:38] (03PS1) 10Elukey: Allow BGP from calico pods running on master nodes on ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/792611 (https://phabricator.wikimedia.org/T308418) [13:10:39] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [13:10:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T303603)', diff saved to https://phabricator.wikimedia.org/P27849 and previous config saved to /var/cache/conftool/dbconfig/20220517-131040-ladsgroup.json [13:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:58] '12 [13:12:01] uff [13:14:19] (03PS2) 10Jbond: redfish: update signature of requests method with kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 [13:14:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T303603)', diff saved to https://phabricator.wikimedia.org/P27850 and previous config saved to /var/cache/conftool/dbconfig/20220517-131453-ladsgroup.json [13:14:58] (03CR) 10Elukey: "The alternative could be to just remove BGP session configuration from the homer public repository, but it may be confusing if we'll want " [deployment-charts] - 10https://gerrit.wikimedia.org/r/792611 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey) [13:14:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:10] (03CR) 10Jbond: "thanks updated to pass kwargs as per irc discussion" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond) [13:16:10] (03CR) 10BCornwall: cli: Add support for XDG Base Directory spec (033 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [13:18:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P27851 and previous config saved to /var/cache/conftool/dbconfig/20220517-131827-ladsgroup.json [13:18:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:05] (03PS1) 10Andrew Bogott: icinga: remove creds for a couple of departed WMCS SREs [puppet] - 10https://gerrit.wikimedia.org/r/792612 [13:21:07] (03PS1) 10Andrew Bogott: icinga: Added Camel Case version of my name as authorized user [puppet] - 10https://gerrit.wikimedia.org/r/792613 (https://phabricator.wikimedia.org/T275920) [13:23:09] 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 2 others: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10hnowlan) [13:24:14] 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata) [13:26:19] (03CR) 10Volans: "In light of IRC chats and the previous comments, I did a full pass and make some suggestions on how to align this more with the SREBatchBa" [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm) [13:29:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27852 and previous config saved to /var/cache/conftool/dbconfig/20220517-132958-ladsgroup.json [13:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:55] (03CR) 10Hnowlan: New service: image-suggestion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [13:33:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P27853 and previous config saved to /var/cache/conftool/dbconfig/20220517-133333-ladsgroup.json [13:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:55] somehow stuck in "ready to submit". Guess I'll have to force merge. [13:35:18] (03PS4) 10Jbond: dhcp: DHCPConfOpt82 and DHCPConfMac media_type parameter [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 [13:35:59] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Papaul) @Dzahn are you planning on re-imaging the server after the move so I know what approach to take for the IP change? [13:36:08] the gate-and-submit pipeline was still running, it was ready to submit due to the V+2 from the main test build [13:39:13] I see. It's an i18n-only patch so the tests wouldn't have much use anyway. [13:39:23] (03CR) 10Volans: "Nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond) [13:39:35] (03CR) 10JMeybohm: [C: 03+1] Allow BGP from calico pods running on master nodes on ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/792611 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey) [13:39:57] (03CR) 10JMeybohm: [C: 03+2] Don't schedule calico kube-controllers on master nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/777364 (owner: 10JMeybohm) [13:40:13] !log tgr@deploy1002 Started scap: Backport with i18n changes: [[gerrit:792478|Account creation: add Thank you banner texts]] [13:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:35] (03CR) 10Jbond: [C: 04-1] cli: Add support for XDG Base Directory spec (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [13:43:18] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: remove creds for a couple of departed WMCS SREs [puppet] - 10https://gerrit.wikimedia.org/r/792612 (owner: 10Andrew Bogott) [13:43:40] 10SRE, 10Icinga, 10Observability-Alerting, 10observability, 10Patch-For-Review: icinga login case mismatch - https://phabricator.wikimedia.org/T275920 (10Andrew) One proposal (which may or may not be possible) would be to standardize on all-lowercase logins in icinga config, and then have our login front... [13:44:26] (03Merged) 10jenkins-bot: Don't schedule calico kube-controllers on master nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/777364 (owner: 10JMeybohm) [13:45:00] (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: Added Camel Case version of my name as authorized user [puppet] - 10https://gerrit.wikimedia.org/r/792613 (https://phabricator.wikimedia.org/T275920) (owner: 10Andrew Bogott) [13:45:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27854 and previous config saved to /var/cache/conftool/dbconfig/20220517-134503-ladsgroup.json [13:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:59] (03PS3) 10Jbond: redfish: update signature of requests method with kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 [13:46:01] (03CR) 10Jbond: "updated thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond) [13:46:21] (03CR) 10Jbond: [C: 03+2] dhcp: DHCPConfOpt82 and DHCPConfMac media_type parameter [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 (owner: 10Jbond) [13:46:33] (03CR) 10Elukey: [C: 03+2] Allow BGP from calico pods running on master nodes on ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/792611 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey) [13:46:56] 10SRE, 10Icinga, 10Observability-Alerting, 10observability, 10Patch-For-Review: icinga login case mismatch - https://phabricator.wikimedia.org/T275920 (10fgiunchedi) I'm ok to stick with capitalized names since that's the convention and AFAICT the default / expected format. [13:47:25] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond) [13:48:05] (03CR) 10Andrew Bogott: [C: 03+2] icinga: remove creds for a couple of departed WMCS SREs [puppet] - 10https://gerrit.wikimedia.org/r/792612 (owner: 10Andrew Bogott) [13:48:14] (03CR) 10Andrew Bogott: [C: 03+2] icinga: Added Camel Case version of my name as authorized user [puppet] - 10https://gerrit.wikimedia.org/r/792613 (https://phabricator.wikimedia.org/T275920) (owner: 10Andrew Bogott) [13:48:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T300774)', diff saved to https://phabricator.wikimedia.org/P27855 and previous config saved to /var/cache/conftool/dbconfig/20220517-134838-ladsgroup.json [13:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:43] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:49:59] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:50:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [13:50:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [13:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:04] (03Abandoned) 10Thiemo Kreuz (WMDE): Duplicate "latest revision may be special" logic from FlaggedRevs [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791248 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight) [13:50:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T300774)', diff saved to https://phabricator.wikimedia.org/P27856 and previous config saved to /var/cache/conftool/dbconfig/20220517-135006-ladsgroup.json [13:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:55] (03CR) 10Jbond: [C: 03+2] admin - gitlab-roots: add *contint_roots_members to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/792576 (https://phabricator.wikimedia.org/T308350) (owner: 10Jbond) [13:52:18] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: fix small issues with the VSL translations [software/conftool] - 10https://gerrit.wikimedia.org/r/792555 (owner: 10Giuseppe Lavagetto) [13:54:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:54:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T300774)', diff saved to https://phabricator.wikimedia.org/P27857 and previous config saved to /var/cache/conftool/dbconfig/20220517-135401-ladsgroup.json [13:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:07] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [13:54:45] (03Merged) 10jenkins-bot: dhcp: DHCPConfOpt82 and DHCPConfMac media_type parameter [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 (owner: 10Jbond) [13:55:01] (03PS1) 10Majavah: openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) [13:55:11] !log tgr@deploy1002 Finished scap: Backport with i18n changes: [[gerrit:792478|Account creation: add Thank you banner texts]] (duration: 14m 57s) [13:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:53] (03Merged) 10jenkins-bot: requestctl: fix small issues with the VSL translations [software/conftool] - 10https://gerrit.wikimedia.org/r/792555 (owner: 10Giuseppe Lavagetto) [13:55:59] (03CR) 10jerkins-bot: [V: 04-1] openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [13:56:51] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10RobH) a:05RobH→03MoritzMuehlenhoff @MoritzMuehlenhoff, Can we plan to have ganeti4002 drained of activity for me on Thursday, May 19th, so I can swap out the defective memory stick? [13:56:57] (03PS2) 10Majavah: openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) [13:58:12] (03CR) 10jerkins-bot: [V: 04-1] openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [13:58:38] (03PS3) 10Majavah: openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) [13:59:23] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) I merged two changes for the ml-serve-eqiad cluster, and now the concerns expressed in T306649#7881940 should be gone:... [13:59:53] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: always add header for detection [software/conftool] - 10https://gerrit.wikimedia.org/r/792580 (owner: 10Giuseppe Lavagetto) [14:00:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T303603)', diff saved to https://phabricator.wikimedia.org/P27858 and previous config saved to /var/cache/conftool/dbconfig/20220517-140008-ladsgroup.json [14:00:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:00:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [14:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:15] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [14:00:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27859 and previous config saved to /var/cache/conftool/dbconfig/20220517-140016-ladsgroup.json [14:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:42] (03Merged) 10jenkins-bot: requestctl: always add header for detection [software/conftool] - 10https://gerrit.wikimedia.org/r/792580 (owner: 10Giuseppe Lavagetto) [14:04:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27860 and previous config saved to /var/cache/conftool/dbconfig/20220517-140431-ladsgroup.json [14:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:36] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [14:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:10] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [14:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:06] (03CR) 10Majavah: "tested on codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [14:07:10] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [14:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [14:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:04] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [14:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:12] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [14:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:29] (03PS4) 10Majavah: openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) [14:09:06] (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: do not ask for confirmation for emtpy changes [software/conftool] - 10https://gerrit.wikimedia.org/r/792581 (owner: 10Giuseppe Lavagetto) [14:09:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27861 and previous config saved to /var/cache/conftool/dbconfig/20220517-140906-ladsgroup.json [14:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:24] (03PS1) 10Ayounsi: wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792621 [14:10:15] (03CR) 10Volans: "It does include also a change in the description, is that wanted?" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792621 (owner: 10Ayounsi) [14:11:15] (03Merged) 10jenkins-bot: requestctl: do not ask for confirmation for emtpy changes [software/conftool] - 10https://gerrit.wikimedia.org/r/792581 (owner: 10Giuseppe Lavagetto) [14:11:45] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Add Aiko and Kevin to the deployment posix group - https://phabricator.wikimedia.org/T308308 (10elukey) [14:12:30] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [14:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:35] (03PS2) 10Ayounsi: wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792621 [14:14:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] New version [software/conftool] - 10https://gerrit.wikimedia.org/r/792556 (owner: 10Giuseppe Lavagetto) [14:14:44] (03PS3) 10Ayounsi: wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792621 [14:14:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:14:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:56] (03PS3) 10Giuseppe Lavagetto: New version [software/conftool] - 10https://gerrit.wikimedia.org/r/792556 [14:16:01] (03Abandoned) 10Ayounsi: wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792621 (owner: 10Ayounsi) [14:17:25] (03CR) 10Giuseppe Lavagetto: New version (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/792556 (owner: 10Giuseppe Lavagetto) [14:18:18] (03PS1) 10Ayounsi: wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792622 [14:19:28] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] New version (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/792556 (owner: 10Giuseppe Lavagetto) [14:19:31] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@6e39559]: Add kcgwiki - T305281 (duration: 119m 34s) [14:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27862 and previous config saved to /var/cache/conftool/dbconfig/20220517-141936-ladsgroup.json [14:19:36] T305281: Post-creation work for kcgwiki - https://phabricator.wikimedia.org/T305281 [14:19:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:59] (03CR) 10Andrew Bogott: [C: 03+1] "This seems straightforward and good!" [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [14:20:40] (03CR) 10Volans: [C: 03+1] "LGTM if the templates have been all updated" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792622 (owner: 10Ayounsi) [14:21:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27863 and previous config saved to /var/cache/conftool/dbconfig/20220517-142411-ladsgroup.json [14:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:00] (03PS3) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) [14:25:29] (03CR) 10jerkins-bot: [V: 04-1] Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar) [14:25:37] (03PS1) 10Cathal Mooney: VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989) [14:26:17] (03CR) 10jerkins-bot: [V: 04-1] VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [14:28:28] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10elukey) [14:29:26] (03CR) 10BCornwall: cli: Add support for XDG Base Directory spec (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [14:30:20] 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7934722, @elukey wrote: > I merged two changes for the ml-serve-eqiad cluster, and now the concerns ex... [14:30:45] (03PS11) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [14:32:46] (03CR) 10Ayounsi: [C: 03+2] Interface automation: fail on duplicate cable ID [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/789089 (owner: 10Ayounsi) [14:33:24] (03CR) 10jerkins-bot: [V: 04-1] prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [14:33:35] (03CR) 10Majavah: openstack: Make enc api enforce keystone policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [14:33:38] (03Merged) 10jenkins-bot: Interface automation: fail on duplicate cable ID [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/789089 (owner: 10Ayounsi) [14:34:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27864 and previous config saved to /var/cache/conftool/dbconfig/20220517-143441-ladsgroup.json [14:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:54] (03PS2) 10Cathal Mooney: VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989) [14:34:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [14:34:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance [14:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:04] (03CR) 10Andrew Bogott: [C: 04-1] "I'm ignoring this pending a more coherent plan about how to host Horizon generally. LMK if I've misunderstood and this is relevant to some" [puppet] - 10https://gerrit.wikimedia.org/r/781950 (https://phabricator.wikimedia.org/T305453) (owner: 10Majavah) [14:35:23] (03CR) 10jerkins-bot: [V: 04-1] VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [14:35:42] (03CR) 10Andrew Bogott: [C: 03+2] delete expired ldap-labs certificates [puppet] - 10https://gerrit.wikimedia.org/r/791674 (owner: 10Dzahn) [14:37:50] 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10elukey) [14:39:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T300774)', diff saved to https://phabricator.wikimedia.org/P27865 and previous config saved to /var/cache/conftool/dbconfig/20220517-143916-ladsgroup.json [14:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:22] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [14:40:33] (03CR) 10Andrew Bogott: [C: 03+1] openstack: Make enc api enforce keystone policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [14:41:11] (03PS3) 10Cathal Mooney: VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989) [14:41:20] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625 [14:42:09] 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [14:44:15] (03CR) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [14:44:31] (03CR) 10Muehlenhoff: "Both have signed a volunteer NDA after leaving, but if they are completely inactive at this point, we should also drop the rest of their a" [puppet] - 10https://gerrit.wikimedia.org/r/792612 (owner: 10Andrew Bogott) [14:45:05] (03PS12) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [14:45:16] (03PS4) 10Cathal Mooney: VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989) [14:47:21] (03CR) 10jerkins-bot: [V: 04-1] prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [14:49:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27867 and previous config saved to /var/cache/conftool/dbconfig/20220517-144946-ladsgroup.json [14:49:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:49:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [14:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:49:51] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [14:49:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T303603)', diff saved to https://phabricator.wikimedia.org/P27868 and previous config saved to /var/cache/conftool/dbconfig/20220517-144959-ladsgroup.json [14:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:25] (03CR) 10Physikerwelt: "could you paste a link to the fixed histograms if deployed?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625 (owner: 10PipelineBot) [14:53:03] (03CR) 10Andrew Bogott: [C: 03+2] icinga: remove creds for a couple of departed WMCS SREs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792612 (owner: 10Andrew Bogott) [14:53:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/792609 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:53:27] (03CR) 10Muehlenhoff: [C: 03+2] visualdiff: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792609 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:54:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T303603)', diff saved to https://phabricator.wikimedia.org/P27869 and previous config saved to /var/cache/conftool/dbconfig/20220517-145406-ladsgroup.json [14:54:10] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:34] (03PS13) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 [14:56:36] (03CR) 10Muehlenhoff: [C: 04-1] "There's currently one remaining email address which isn't @wikimedia.org or found in the task description of https://phabricator.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/792608 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:57:46] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:59:09] (03CR) 10Muehlenhoff: icinga: remove creds for a couple of departed WMCS SREs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792612 (owner: 10Andrew Bogott) [14:59:57] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [15:00:59] (03CR) 10Muehlenhoff: "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/792607 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [15:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:03:22] (03PS2) 10JMeybohm: Remove null creationTimestamp from CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/792267 (https://phabricator.wikimedia.org/T306165) [15:03:24] (03PS6) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165) [15:03:26] (03PS4) 10JMeybohm: Add a rake task to generate JSON schema for chart CRDs on the fly [deployment-charts] - 10https://gerrit.wikimedia.org/r/792577 (https://phabricator.wikimedia.org/T306165) [15:04:56] (03PS4) 10Jbond: redfish: update signature of requests method with kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 [15:05:01] (03CR) 10Jbond: [C: 03+2] redfish: update signature of requests method with kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond) [15:07:33] (03CR) 10Jbond: [C: 04-1] cli: Add support for XDG Base Directory spec (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall) [15:09:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27870 and previous config saved to /var/cache/conftool/dbconfig/20220517-150911-ladsgroup.json [15:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:17] (03PS1) 10Jbond: Revert "delete expired ldap-labs certificates" [puppet] - 10https://gerrit.wikimedia.org/r/792482 [15:11:24] (03Abandoned) 10Jbond: Revert "delete expired ldap-labs certificates" [puppet] - 10https://gerrit.wikimedia.org/r/792482 (owner: 10Jbond) [15:12:48] (03Merged) 10jenkins-bot: redfish: update signature of requests method with kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond) [15:13:18] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:13:31] (03PS1) 10BCornwall: "admin: Re-add user "brett" to ops group"" [puppet] - 10https://gerrit.wikimedia.org/r/792483 [15:13:41] (03CR) 10Filippo Giunchedi: [C: 04-1] "Not ready yet" [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [15:14:20] (03PS2) 10BCornwall: "admin: Re-add user "brett" to ops group"" [puppet] - 10https://gerrit.wikimedia.org/r/792483 [15:15:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:15:21] (03CR) 10jerkins-bot: [V: 04-1] "admin: Re-add user "brett" to ops group"" [puppet] - 10https://gerrit.wikimedia.org/r/792483 (owner: 10BCornwall) [15:16:54] (03PS1) 10Ssingh: durum: return the site/DC in the check response [puppet] - 10https://gerrit.wikimedia.org/r/792635 [15:17:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.324 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:17:44] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48108 bytes in 0.219 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:17:53] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35309/console" [puppet] - 10https://gerrit.wikimedia.org/r/792635 (owner: 10Ssingh) [15:20:38] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: return the site/DC in the check response [puppet] - 10https://gerrit.wikimedia.org/r/792635 (owner: 10Ssingh) [15:22:46] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:24:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27871 and previous config saved to /var/cache/conftool/dbconfig/20220517-152416-ladsgroup.json [15:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:41] (03CR) 10Andrew Bogott: [C: 03+2] icinga: remove creds for a couple of departed WMCS SREs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792612 (owner: 10Andrew Bogott) [15:32:59] (03PS1) 10Ssingh: durum: set site to null when Wikidough is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/792638 [15:33:51] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35310/console" [puppet] - 10https://gerrit.wikimedia.org/r/792638 (owner: 10Ssingh) [15:34:33] (03CR) 10Jdlrobson: [C: 03+1] Deploy TOC A/B test to pilot wikis except frwiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming) [15:34:36] (03PS4) 10Jdlrobson: Deploy TOC A/B test to pilot wikis except frwiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming) [15:36:49] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: set site to null when Wikidough is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/792638 (owner: 10Ssingh) [15:38:02] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:39:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T303603)', diff saved to https://phabricator.wikimedia.org/P27872 and previous config saved to /var/cache/conftool/dbconfig/20220517-153921-ladsgroup.json [15:39:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:39:27] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [15:39:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [15:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:32] (03PS3) 10Ssingh: "admin: Re-add user "brett" to ops group"" [puppet] - 10https://gerrit.wikimedia.org/r/792483 (owner: 10BCornwall) [15:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [15:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:40:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [15:43:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [15:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T303603)', diff saved to https://phabricator.wikimedia.org/P27873 and previous config saved to /var/cache/conftool/dbconfig/20220517-154310-ladsgroup.json [15:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:46] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Betacommons: 504, Connection Timed Out at 2022-05-02 13:35:16 GMT - https://phabricator.wikimedia.org/T307354 (10AlexisJazz) Right now it works, as usual with these it was a transient error. [15:45:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T303603)', diff saved to https://phabricator.wikimedia.org/P27874 and previous config saved to /var/cache/conftool/dbconfig/20220517-154502-ladsgroup.json [15:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:08] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [15:45:54] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10KartikMistry) [15:52:01] RECOVERY - Disk space on ms-be1040 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1040&var-datasource=eqiad+prometheus/ops [15:57:35] RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:05] jbond and rzl: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:22:42] (03CR) 10Btullis: [C: 03+1] Move Hadoop eventlogs cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [16:27:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [16:27:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance [16:27:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298555)', diff saved to https://phabricator.wikimedia.org/P27875 and previous config saved to /var/cache/conftool/dbconfig/20220517-162738-ladsgroup.json [16:27:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:45] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [16:28:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Manual repool', diff saved to https://phabricator.wikimedia.org/P27876 and previous config saved to /var/cache/conftool/dbconfig/20220517-162835-ladsgroup.json [16:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [16:30:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [16:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27877 and previous config saved to /var/cache/conftool/dbconfig/20220517-163024-ladsgroup.json [16:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:30] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [16:34:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27878 and previous config saved to /var/cache/conftool/dbconfig/20220517-163446-ladsgroup.json [16:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:00] (03PS1) 10Jbond: hiera_export: add unmanaged (mostly) network devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/792644 [16:48:55] (03CR) 10Dzahn: [C: 03+1] "looks good to me. thanks! (random comment: I always read this as "lion update". Can't help it:)" [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [16:49:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27880 and previous config saved to /var/cache/conftool/dbconfig/20220517-164951-ladsgroup.json [16:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27881 and previous config saved to /var/cache/conftool/dbconfig/20220517-170456-ladsgroup.json [17:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:26] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) [17:07:59] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) ganeti4003 is from the same batch and needs the same updates. I've migrated instances, removed it from the cluster for the reimage and downtimed it. [17:08:15] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti4003.ulsfo.wmnet with reason: Remove from cluster for eventual reimage [17:08:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti4003.ulsfo.wmnet with reason: Remove from cluster for eventual reimage [17:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:32] (03PS1) 10David Caro: wmcs-image-create: Remove puppet cron on the template image [puppet] - 10https://gerrit.wikimedia.org/r/792669 [17:12:09] (03PS1) 10Muehlenhoff: Enable ganeti4004 as Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/792670 [17:13:01] (03PS2) 10David Caro: wmcs-image-create: Remove puppet cron on the template image [puppet] - 10https://gerrit.wikimedia.org/r/792669 [17:16:27] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) Hey @Dzahn my apologies for the delay. I just completed the first two steps: - ITS conf... [17:20:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27882 and previous config saved to /var/cache/conftool/dbconfig/20220517-172001-ladsgroup.json [17:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:07] T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603 [17:24:00] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for turnilo/superset staging on Bullseye - https://phabricator.wikimedia.org/T306213 (10razzi) I'm going to go ahead and put this on row A. Here's a little snippet I used to look at the ganeti resource totals by row (`python -m pip instal... [17:25:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [17:25:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [17:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T300774)', diff saved to https://phabricator.wikimedia.org/P27883 and previous config saved to /var/cache/conftool/dbconfig/20220517-172521-ladsgroup.json [17:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:29] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [17:26:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T300774)', diff saved to https://phabricator.wikimedia.org/P27884 and previous config saved to /var/cache/conftool/dbconfig/20220517-172632-ladsgroup.json [17:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:37] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [17:28:41] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Swift [17:30:24] (03PS1) 10Ssingh: durum: display the DC the user is connected to in the frontend [puppet] - 10https://gerrit.wikimedia.org/r/792676 [17:31:09] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35311/console" [puppet] - 10https://gerrit.wikimedia.org/r/792676 (owner: 10Ssingh) [17:34:09] (03PS2) 10Jforrester: TimedMediaHandler: Disabled the BetaFeature from wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788385 (https://phabricator.wikimedia.org/T248418) [17:38:43] RECOVERY - Host an-tool1005 is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms [17:43:27] PROBLEM - SSH on an-tool1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:47:13] RECOVERY - SSH on an-tool1005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:53:07] PROBLEM - Host an-tool1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:54:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:55:47] RECOVERY - Host an-tool1005 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms [17:58:09] !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm for new host an-tool1011.eqiad.wmnet [17:58:10] !log razzi@cumin1001 START - Cookbook sre.dns.netbox [17:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:11] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:02:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wqds101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Jclark-ctr) [18:04:28] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Jclark-ctr) [18:04:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Jclark-ctr) [18:06:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (10Jclark-ctr) [18:08:41] (03PS1) 10Razzi: dhcpd: make an-tool1005 use debian 10 [puppet] - 10https://gerrit.wikimedia.org/r/792686 (https://phabricator.wikimedia.org/T308597) [18:08:45] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Jclark-ctr) [18:09:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Jclark-ctr) [18:13:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Jclark-ctr) [18:15:25] (03CR) 10Razzi: [C: 03+2] dhcpd: make an-tool1005 use debian 10 [puppet] - 10https://gerrit.wikimedia.org/r/792686 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi) [18:16:58] !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:17:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:09] PROBLEM - Host an-tool1005 is DOWN: PING CRITICAL - Packet loss = 100% [18:22:07] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:26:15] RECOVERY - Host an-tool1005 is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms [18:26:37] !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-tool1011.eqiad.wmnet [18:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:27] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.24: Connection reset by peer https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:40:38] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:43:16] PROBLEM - MariaDB Replica Lag: s2 #page on db1156 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 21585.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:43:35] here [18:43:38] that's me [18:43:40] that is a big gap, is that a depooled host under mainteinace? [18:43:42] uh? [18:43:44] the depool time wasn't enough [18:43:53] the downtime [18:43:55] downtime? [18:43:58] ah, cool [18:43:58] ah ok [18:43:59] yeah [18:44:00] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10RobH) updates all done, system is back up for reimage whenever [18:44:05] so it is depooled, right? [18:44:12] Don't scare people like that :-) [18:44:26] yes, it is depooled [18:44:27] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10RobH) a:05RobH→03MoritzMuehlenhoff [18:44:38] my bad [18:44:46] resolved [18:45:02] let me downtime it for two more hours [18:45:20] for people on call, https://grafana.wikimedia.org/d/000000278/mysql-aggregated and https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard are good dashboards to double check no impact [18:45:20] Amir1: give it 4 just in case XD [18:45:47] (03CR) 10Zabe: wikilabels: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792608 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [18:46:00] hey [18:46:32] oh ok, resolved already :) [18:46:41] bblack: yup! [18:46:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1156.eqiad.wmnet with reason: Maint [18:46:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1156.eqiad.wmnet with reason: Maint [18:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:58] downtimed for five more hours ^ [18:47:10] if nobody has acked yet on VO I suggest to do that though [18:47:21] I did [18:48:08] thx [18:48:39] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:40] (03PS1) 10Zabe: varnishkafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792692 (https://phabricator.wikimedia.org/T308013) [18:54:20] sorry, alter table on revision table takes long but more than six hours on s2? That was a bit unexpected [18:55:36] Amir1: assuming you are working now (ignore me if not) there is some weird pattern for uncached traffic since a few hours ago [18:55:39] (03PS1) 10Zabe: vagrant: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792693 (https://phabricator.wikimedia.org/T308013) [18:55:53] some if it must be just the alter (extra db writes) [18:55:53] jynus: where is it? [18:56:04] but some may not be explained by it [18:56:10] I'm cleaning up x1 as well [18:56:13] (03CR) 10jerkins-bot: [V: 04-1] vagrant: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792693 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [18:56:25] Amir1: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard [18:57:09] let me set the time so it is clearer [18:57:23] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1652209031443&to=1652813831443 [18:57:35] which panel? [18:57:56] a few- first regular 200 get requests [18:58:10] which wouldn't be too worrying as that would be just traffic-related [18:58:26] and the db ones would be explained by schema changes [18:58:31] but see mcrouter [18:58:37] I don't know the dip but the pattern looks normali-sih [18:58:52] that means a performance issue- more parsings than usual [18:59:05] (03PS1) 10Zabe: vagrant: add shebang to alias-vagrant-profile-d.sh [puppet] - 10https://gerrit.wikimedia.org/r/792694 [18:59:06] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1652209031443&to=1652813831443 [18:59:22] seems back to normal now [18:59:25] I think there is someone parsing stuff with 100 req/s [18:59:34] yeah, that would explain it [18:59:49] as long as it is external-triggered no issue [19:00:04] (03CR) 10Zabe: vagrant: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792693 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [19:00:52] will keep an eye on that tomorrow [19:00:55] leaving for now [19:00:59] have fun! [19:01:12] as the effect is only a slight perf increase, nothing to crazy [19:01:36] well, perf decrease, latency increase :-) [19:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:01:58] have a nice day! [19:07:32] (03PS1) 10BryanDavis: toolhub: Bump container version to 2022-05-17-072641-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/792696 (https://phabricator.wikimedia.org/T303909) [19:07:33] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:08:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson frdb1005 c1. u3. port; 4 , 4 cableid# 2945 , 4042 frdev1003 c1 u4... [19:10:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Jclark-ctr) [19:15:02] (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: display the DC the user is connected to in the frontend [puppet] - 10https://gerrit.wikimedia.org/r/792676 (owner: 10Ssingh) [19:18:58] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2022-05-17-072641-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/792696 (https://phabricator.wikimedia.org/T303909) (owner: 10BryanDavis) [19:21:47] (03PS1) 10Majavah: nrpe: add nrpe::script to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700 [19:22:23] (03PS1) 10Andrew Bogott: profile::wmcs::instance: create nrpe plugin directory [puppet] - 10https://gerrit.wikimedia.org/r/792701 [19:22:41] (03PS2) 10Majavah: nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700 [19:22:46] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [19:24:06] (03Merged) 10jenkins-bot: toolhub: Bump container version to 2022-05-17-072641-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/792696 (https://phabricator.wikimedia.org/T303909) (owner: 10BryanDavis) [19:25:08] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35312/console" [puppet] - 10https://gerrit.wikimedia.org/r/792700 (owner: 10Majavah) [19:25:58] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [19:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:58] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [19:27:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:04] (03CR) 10jerkins-bot: [V: 04-1] nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700 (owner: 10Majavah) [19:28:08] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [19:28:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:32] (03PS4) 10Ssingh: admin: Re-add user "brett" to ops group [puppet] - 10https://gerrit.wikimedia.org/r/792483 (owner: 10BCornwall) [19:28:57] (03PS3) 10Majavah: nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700 [19:29:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Jclark-ctr) frlog1002 C1 U37 port; 2 , 2 cableid# 23000047 , 23000061 [19:29:55] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:29:55] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [19:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:42] (03CR) 10BCornwall: [C: 03+2] admin: Re-add user "brett" to ops group [puppet] - 10https://gerrit.wikimedia.org/r/792483 (owner: 10BCornwall) [19:31:52] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35313/console" [puppet] - 10https://gerrit.wikimedia.org/r/792700 (owner: 10Majavah) [19:32:15] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:32:52] (03CR) 10jerkins-bot: [V: 04-1] nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700 (owner: 10Majavah) [19:33:25] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [19:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:55] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [19:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:24] (03CR) 10Herron: "Nice! looks good, thanks for putting it together" [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [19:35:39] (03PS2) 10Andrew Bogott: profile::wmcs::instance: create nrpe plugin directory [puppet] - 10https://gerrit.wikimedia.org/r/792701 (https://phabricator.wikimedia.org/T308601) [19:35:54] (03PS4) 10Andrew Bogott: nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [19:38:27] (03CR) 10Majavah: [C: 04-1] "this would need /usr/lib/nagios/plugins/ as well to be fully effective" [puppet] - 10https://gerrit.wikimedia.org/r/792701 (https://phabricator.wikimedia.org/T308601) (owner: 10Andrew Bogott) [19:38:38] (03PS5) 10Majavah: nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700 (https://phabricator.wikimedia.org/T308601) [19:38:40] (03PS1) 10Majavah: base::firewall: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792705 (https://phabricator.wikimedia.org/T308601) [19:40:11] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35315/console" [puppet] - 10https://gerrit.wikimedia.org/r/792705 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [19:40:24] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35314/console" [puppet] - 10https://gerrit.wikimedia.org/r/792700 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah) [19:41:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Jclark-ctr) [19:41:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [19:41:52] 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Jclark-ctr) [19:44:07] !log Updated Toolhub to 42072d, applied db migrations, and rebuilt search indexes [19:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:17] (03PS1) 10Ssingh: test_dns: update DNS/durum test to reflect changes in API [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/792706 [19:59:09] (03CR) 10Ssingh: [C: 03+2] test_dns: update DNS/durum test to reflect changes in API [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/792706 (owner: 10Ssingh) [20:00:05] RoanKattouw, Urbanecm, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T2000). [20:00:05] cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:35] i'm the only one so i'll deploy my own patch [20:01:03] and wait around for a few before closing window [20:01:19] (03PS1) 10Ladsgroup: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/792707 (https://phabricator.wikimedia.org/T301312) [20:01:34] (03CR) 10Clare Ming: [C: 03+2] Deploy TOC A/B test to pilot wikis except frwiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming) [20:02:34] (03Merged) 10jenkins-bot: Deploy TOC A/B test to pilot wikis except frwiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming) [20:03:05] (03PS1) 10Ladsgroup: db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/792708 (https://phabricator.wikimedia.org/T301312) [20:04:19] (03PS1) 10Ladsgroup: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/792709 (https://phabricator.wikimedia.org/T301312) [20:05:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:15] (03CR) 10Ladsgroup: [C: 04-2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/792709 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup) [20:05:24] (03CR) 10Ladsgroup: [C: 04-2] db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/792708 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup) [20:05:31] (03CR) 10Ladsgroup: [C: 04-2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/792707 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup) [20:06:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:06:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:06:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:06:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:26] (03PS1) 10Stang: betawikiversity: HIDPI support for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792710 (https://phabricator.wikimedia.org/T308604) [20:08:46] Hi cjming, is it still ok to deploy? [20:09:17] hi koi: sure - i'm just finishing up my patch [20:09:27] will do yours here shortly [20:09:35] ack and thanks [20:10:15] (03PS3) 10Andrew Bogott: profile::wmcs::instance: create nrpe plugin directory [puppet] - 10https://gerrit.wikimedia.org/r/792701 (https://phabricator.wikimedia.org/T308601) [20:10:57] (03PS1) 10Ssingh: durum: update check.js site names [puppet] - 10https://gerrit.wikimedia.org/r/792711 [20:11:46] !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:792272|Deploy TOC A/B test to pilot wikis except frwiki, ptwiki (T306607)]] (duration: 00m 53s) [20:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:51] T306607: Deploy ToC A/B test to remainder of desktop improvements pilot wikis - https://phabricator.wikimedia.org/T306607 [20:12:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:25] (03PS2) 10Clare Ming: betawikiversity: HIDPI support for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792710 (https://phabricator.wikimedia.org/T308604) (owner: 10Stang) [20:13:29] (03CR) 10Clare Ming: [C: 03+2] betawikiversity: HIDPI support for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792710 (https://phabricator.wikimedia.org/T308604) (owner: 10Stang) [20:13:39] (03CR) 10Ssingh: [C: 03+2] durum: update check.js site names [puppet] - 10https://gerrit.wikimedia.org/r/792711 (owner: 10Ssingh) [20:14:19] (03Merged) 10jenkins-bot: betawikiversity: HIDPI support for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792710 (https://phabricator.wikimedia.org/T308604) (owner: 10Stang) [20:15:49] koi: can you check changes on mwdebug1001? [20:16:12] looking [20:17:09] LGTM [20:17:18] great - syncing now [20:18:24] (03PS1) 10Razzi: site: add an-tool1011 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/792712 (https://phabricator.wikimedia.org/T308597) [20:18:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:18:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:18:50] !log cjming@deploy1002 Synchronized static/images/project-logos/betawikiversity.png: Config: [[gerrit:792710|betawikiversity: HIDPI support for logo (T308604)]] (duration: 00m 54s) [20:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:00] T308604: Optimize Logo of Beta Wikiversity - https://phabricator.wikimedia.org/T308604 [20:19:50] !log cjming@deploy1002 Synchronized static/images/project-logos/betawikiversity-1.5x.png: Config: [[gerrit:792710|betawikiversity: HIDPI support for logo (T308604)]] (duration: 00m 56s) [20:19:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:46] !log cjming@deploy1002 Synchronized static/images/project-logos/betawikiversity-2x.png: Config: [[gerrit:792710|betawikiversity: HIDPI support for logo (T308604)]] (duration: 00m 53s) [20:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:41] !log cjming@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:792710|betawikiversity: HIDPI support for logo (T308604)]] (duration: 00m 52s) [20:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:41] !log cjming@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:792710|betawikiversity: HIDPI support for logo (T308604)]] (duration: 00m 53s) [20:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:52] koi: your changes should be live [20:22:59] thanks! [20:23:03] np! [20:25:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:33] !log end of UTC late backport & config window [20:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:19] (03CR) 10Razzi: [C: 03+2] site: add an-tool1011 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/792712 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi) [20:30:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:31:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:00] (03CR) 10Dzahn: [V: 03+2 C: 03+2] "spot checked a couple of the certs. looks good. usually people don't even create "real fake" certs and just put "placeholder" or "snake oi" [labs/private] - 10https://gerrit.wikimedia.org/r/791667 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans) [20:34:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:17] (03PS1) 10Razzi: install_server: add an-tool1011 as virtual [puppet] - 10https://gerrit.wikimedia.org/r/792718 (https://phabricator.wikimedia.org/T308597) [20:36:49] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dmantena) @RLazarus Sorry for re-opening this task, but while it appears I have Superset access, it doesn't appear I have SQL/Presto access to be able to view the analytics data I was after. Here'... [20:37:50] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10RhinosF1) 05Resolved→03Open [20:39:31] (03PS1) 10Bking: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) [20:39:51] (03CR) 10Razzi: [C: 03+2] install_server: add an-tool1011 as virtual [puppet] - 10https://gerrit.wikimedia.org/r/792718 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi) [20:40:24] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Krinkle) [20:41:07] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10RhinosF1) I've left a message with Analytics to check but based on https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I_request?, I think this may need shell access / a p... [20:42:04] (03CR) 10jerkins-bot: [V: 04-1] elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [20:43:43] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:45:20] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Milimetric) Indeed, RhinosF1 is right, take a look at that link and I believe you need analytics-privatedata-users to run queries and access Presto-backed dashboards [20:50:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298560)', diff saved to https://phabricator.wikimedia.org/P27888 and previous config saved to /var/cache/conftool/dbconfig/20220517-205030-ladsgroup.json [20:50:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:36] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10RhinosF1) 05Open→03Resolved @DMantena: Can you file a new task using https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ or copy the information from that form into this task? A bit... [20:50:37] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [20:52:06] RECOVERY - MariaDB Replica Lag: s2 #page on db1156 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:53:07] (03CR) 10Andrew Bogott: [C: 03+2] "This is a perfectly reasonable refactor -- I'm going to merge it right now so that I can add another line on top." [puppet] - 10https://gerrit.wikimedia.org/r/792669 (owner: 10David Caro) [20:57:21] (03PS1) 10Ebernhardson: Resolve minimum_should_match warnings during random scoring [extensions/CirrusSearch] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792649 (https://phabricator.wikimedia.org/T288765) [20:57:38] (03PS1) 10Ebernhardson: haslicense: Apply minimum_should_match for elastic 7.x [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792650 (https://phabricator.wikimedia.org/T288765) [20:59:33] PROBLEM - Check size of conntrack table on an-tool1005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:00:29] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:01:13] PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:51] RECOVERY - Check size of conntrack table on an-tool1005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [21:02:13] RECOVERY - puppet last run on an-tool1005 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [21:03:19] (03PS1) 10Andrew Bogott: wmcs-image-create.py: Inject a couple of nagios plugin dirs into our image [puppet] - 10https://gerrit.wikimedia.org/r/792721 (https://phabricator.wikimedia.org/T308601) [21:03:31] RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:04:17] (03CR) 10jerkins-bot: [V: 04-1] wmcs-image-create.py: Inject a couple of nagios plugin dirs into our image [puppet] - 10https://gerrit.wikimedia.org/r/792721 (https://phabricator.wikimedia.org/T308601) (owner: 10Andrew Bogott) [21:05:13] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) 05Stalled→03Open [21:05:19] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) [21:05:29] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) [21:05:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P27889 and previous config saved to /var/cache/conftool/dbconfig/20220517-210535-ladsgroup.json [21:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:39] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) 05Open→03In progress [21:09:55] (03PS2) 10Andrew Bogott: wmcs-image-create.py: Inject a couple of nagios plugin dirs into our image [puppet] - 10https://gerrit.wikimedia.org/r/792721 (https://phabricator.wikimedia.org/T308601) [21:10:01] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:14:50] (03CR) 10Volans: "generic comment inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking) [21:20:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P27890 and previous config saved to /var/cache/conftool/dbconfig/20220517-212040-ladsgroup.json [21:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P27891 and previous config saved to /var/cache/conftool/dbconfig/20220517-212316-ladsgroup.json [21:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [21:25:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [21:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T300774)', diff saved to https://phabricator.wikimedia.org/P27892 and previous config saved to /var/cache/conftool/dbconfig/20220517-212530-ladsgroup.json [21:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:36] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [21:27:53] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) Hi @bcampbell I removed the donate@ alias from the mail servers right now. I can confirm it n... [21:28:53] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) How about "donation@" as opposed to "donate@". Is that an alias for fundraising for for donat... [21:33:03] RECOVERY - Check systemd state on an-tool1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:37:30] (03PS1) 10Razzi: turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) [21:37:51] (03CR) 10jerkins-bot: [V: 04-1] turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi) [21:38:21] (03PS2) 10Razzi: turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) [21:38:57] (03CR) 10jerkins-bot: [V: 04-1] turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi) [21:43:28] 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for Tsevener - https://phabricator.wikimedia.org/T308616 (10Tsevener) [21:43:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T300774)', diff saved to https://phabricator.wikimedia.org/P27893 and previous config saved to /var/cache/conftool/dbconfig/20220517-214349-ladsgroup.json [21:43:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:55] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [21:44:15] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 111 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:46:33] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 31 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:46:56] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) I've sent test mail from a couple different addresses, one internal and one external, and... [21:48:17] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:52:22] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for turnilo/superset staging on Bullseye - https://phabricator.wikimedia.org/T306213 (10razzi) 05Open→03Resolved VM created. Work continues at https://phabricator.wikimedia.org/T308597 [21:52:24] !log alert1001 - systemctl start certspotter (after alert that the unit was failed. happens sometimes) [21:52:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:00] (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:54:19] 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10Volans) I've a local patch that I'm testing to perform the validation of the whole dataset (manual + netbox). The preliminary results are b... [21:56:13] (03PS3) 10Razzi: turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) [21:58:34] (03CR) 10Volans: hiera_export: add unmanaged (mostly) network devices (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/792644 (owner: 10Jbond) [21:58:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P27894 and previous config saved to /var/cache/conftool/dbconfig/20220517-215854-ladsgroup.json [21:58:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:56] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS entries - https://phabricator.wikimedia.org/T271143 (10bking) a:03bking [22:01:13] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10bking) [22:03:20] 10SRE, 10ops-codfw: Recycling Pickup for CODFW - https://phabricator.wikimedia.org/T307694 (10Papaul) Pickup and on site shred complete . {F35150027} [22:04:43] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:05:56] jouncebot: nowandnext [22:05:56] No deployments scheduled for the next 8 hour(s) and 54 minute(s) [22:05:56] In 8 hour(s) and 54 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220518T0700) [22:07:17] * urbanecm stashing at debug servers [22:07:54] * urbanecm finished [22:08:21] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:59] (03CR) 10Brennen Bearnes: "This has been tested against an existing WMCS runner. Works as expected. Sample error message in failed pipeline:" [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [22:09:36] * urbanecm goes to deploy a patch now [22:10:12] (03PS3) 10Andrew Bogott: wmcs-image-create.py: Inject a couple of nagios plugin dirs into our image [puppet] - 10https://gerrit.wikimedia.org/r/792721 (https://phabricator.wikimedia.org/T308601) [22:10:39] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:12:54] (03PS1) 10Urbanecm: langlist: add kcg language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792731 (https://phabricator.wikimedia.org/T305279) [22:12:56] (03CR) 10Urbanecm: [C: 03+2] langlist: add kcg language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792731 (https://phabricator.wikimedia.org/T305279) (owner: 10Urbanecm) [22:12:58] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792732 [22:13:01] (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792732 (owner: 10Urbanecm) [22:13:49] (03Merged) 10jenkins-bot: langlist: add kcg language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792731 (https://phabricator.wikimedia.org/T305279) (owner: 10Urbanecm) [22:13:55] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792732 (owner: 10Urbanecm) [22:14:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P27895 and previous config saved to /var/cache/conftool/dbconfig/20220517-221359-ladsgroup.json [22:14:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:34] !log urbanecm@deploy1002 Synchronized langlist: cd704d4f: langlist: add kcg language (T305279) (duration: 00m 53s) [22:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:39] T305279: Create Wikipedia Tyap - https://phabricator.wikimedia.org/T305279 [22:16:27] !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: c2151b3: Update interwiki cache (duration: 00m 52s) [22:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:16:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:21] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:17:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:17:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:17:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:14] * urbanecm done with deployment [22:19:27] (03CR) 10Eevans: [C: 04-1] WIP: enable cassandra encryption (aqs cluster) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans) [22:19:57] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:22:45] (03PS4) 10Razzi: turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) [22:22:50] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10bking) For clarity, client-side IPv6 connectivity to search functions in wikipedia, wikicommons, etc does not require the Elas... [22:23:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:23:33] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35320/console" [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi) [22:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:46] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) The following aliases have all been removed on the SRE side now: donation@ donations@ donate@... [22:24:01] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:24:33] 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10bking) 05Open→03Resolved [22:25:03] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) 05In progress→03Resolved [22:25:09] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn) [22:27:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:27:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T300774)', diff saved to https://phabricator.wikimedia.org/P27896 and previous config saved to /var/cache/conftool/dbconfig/20220517-222904-ladsgroup.json [22:29:08] (03PS1) 10Razzi: turnilo: change an-tool1011 to use bullseye [puppet] - 10https://gerrit.wikimedia.org/r/792733 (https://phabricator.wikimedia.org/T308597) [22:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:28] T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774 [22:29:45] (03CR) 10Razzi: [V: 03+1] "Now that turnilo requires Debian 11 and superset requires Debian 10, this patch moves turnilo to a newly created dedicated turnilo staging" [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi) [22:31:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:52] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [22:31:54] 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Dzahn) @Papaul Yea, reimaging is no problem. It's still in "insetup" and I can do it. Pick the easier option for you. [22:32:29] urbanecm: Please follow https://wikitech.wikimedia.org/wiki/Deployments/Emergencies in future for deploys outside of the deploy windows. [22:38:25] (03CR) 10Brennen Bearnes: gitlab runner: restrict docker images and services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes) [22:38:27] (03CR) 10Jforrester: "Will this need a change like 24a6e44a5bb3f9b30d13c9852577c3c0678bf62d too as you're switching to service-runner 3 from 2?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625 (owner: 10PipelineBot) [22:44:09] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:44:31] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:44:55] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:48:19] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:59:57] PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:01:57] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:02:07] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:05:43] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:16:47] PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:22:46] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:28:51] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:29:40] (03PS1) 10Jforrester: [shnwiki] Enable the SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792737 (https://phabricator.wikimedia.org/T308623) [23:53:59] RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state