[00:39:00] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:54:24] <icinga-wm>	 PROBLEM - Host an-tool1005 is DOWN: PING CRITICAL - Packet loss = 100%
[01:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T0100)
[01:05:22] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-config-backup-gitlab1003.wikimedia.org.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:06:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2022:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[01:35:36] <icinga-wm>	 PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:39:17] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata) p:05Triage→03Medium
[01:43:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:53:45] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:05:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:05:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:05:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:05:58] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:06:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:06:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:06:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:06:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:07:50] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.12 [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792302
[02:07:54] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.12 [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792302 (owner: 10TrainBranchBot)
[02:22:59] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.12 [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792302 (owner: 10TrainBranchBot)
[02:27:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[02:27:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:28:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[02:28:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[02:28:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:28:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:28:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[02:28:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:01:57] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:24:28] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:26:42] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:37:08] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:37:42] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:46:28] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:04:43] <wikibugs>	 (03PS3) 10KartikMistry: Enable Section Translation in bcl, is, ne, pa, ts and ur Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791481 (https://phabricator.wikimedia.org/T304828)
[04:24:22] <icinga-wm>	 PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:32:08] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:34:04] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:38:54] <icinga-wm>	 RECOVERY - SSH on wtp1046.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:06:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2022:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:20:04] <icinga-wm>	 RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:23:00] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 69, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:23:20] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 45, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:44:56] <_joe_>	 !log restarted rsyslog on kubernetes2022
[05:45:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:45:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2022:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[05:54:00] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:00:04] <jouncebot>	 kormat, marostegui, and Amir1: Your horoscope predicts another unfortunate Primary database switchover deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T0600).
[06:12:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10JMeybohm) >>! In T306649#7931058, @akosiaris wrote: >> Regarding the "fake nodes": I think that could be done with adding the le...
[06:23:38] <wikibugs>	 10SRE, 10SRE-OnFire, 10conftool, 10Sustainability (Incident Followup): Invalid confctl selector should either error out or select nothing - https://phabricator.wikimedia.org/T308100 (10Joe) 05Open→03Resolved p:05Triage→03High
[06:25:43] <wikibugs>	 10SRE, 10conftool: requestctl v1 improvements - https://phabricator.wikimedia.org/T305580 (10Joe)
[06:25:46] <wikibugs>	 10SRE, 10conftool, 10Patch-For-Review: Provide a meaningful Retry-After value - https://phabricator.wikimedia.org/T305824 (10Joe) 05Open→03Resolved
[06:29:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10ayounsi) > Plus, they are VMs and we have the same problem we have with the kask dedicated nodes (also VMs). Netbox doesn't have...
[06:33:47] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] msw: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792204 (owner: 10Ayounsi)
[06:34:35] <wikibugs>	 (03Merged) 10jenkins-bot: msw: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792204 (owner: 10Ayounsi)
[06:37:57] <XioNoX>	 !log management switches, split configuration per interfaces (use new get_junos_interfaces function)
[06:38:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:41:26] <wikibugs>	 (03PS4) 10Slyngshede: Move l10nupdate to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673)
[06:41:47] <wikibugs>	 (03PS5) 10Slyngshede: Move l10nupdate to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673)
[06:42:06] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] mr: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792205 (owner: 10Ayounsi)
[06:42:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move l10nupdate to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[06:42:40] <wikibugs>	 (03Merged) 10jenkins-bot: mr: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792205 (owner: 10Ayounsi)
[06:44:26] <wikibugs>	 (03PS6) 10Slyngshede: Move l10nupdate to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673)
[06:49:22] <XioNoX>	 !log management routers, split configuration per interfaces (use new get_junos_interfaces function)
[06:49:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove http availability pages, moved to prometheus (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790671 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[06:52:57] <wikibugs>	 (03PS2) 10WMDE-Fisch: Deploy VE template dialog improvements to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791314 (https://phabricator.wikimedia.org/T306967)
[06:53:04] <wikibugs>	 (03PS2) 10WMDE-Fisch: Deploy template search improvements to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791315 (https://phabricator.wikimedia.org/T303802)
[06:56:46] <wikibugs>	 (03PS7) 10Slyngshede: Move l10nupdate to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673)
[06:59:46] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35301/console" [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T0700).
[07:00:05] <jouncebot>	 WMDE-Fisch and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:21] * kart_ is here
[07:00:23] <WMDE-Fisch>	 \o I can selve serve
[07:00:43] * WMDE-Fisch starts
[07:00:43] <kart_>	 Cool. Please go ahead and let me know once done.
[07:00:51] * urbanecm waves too
[07:00:59] <urbanecm>	 But leaves WMDE-Fisch to self serve :))
[07:01:56] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:02:06] <wikibugs>	 (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791314 (https://phabricator.wikimedia.org/T306967) (owner: 10WMDE-Fisch)
[07:03:08] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy VE template dialog improvements to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791314 (https://phabricator.wikimedia.org/T306967) (owner: 10WMDE-Fisch)
[07:03:30] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "Fixed comments." [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:04:40] * WMDE-Fisch testing first patch on debug1001
[07:06:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:06:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:07:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:07:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:46] <logmsgbot>	 !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791314|Deploy VE template dialog improvements to enwiki (T306967)]] (duration: 00m 50s)
[07:07:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:50] <stashbot>	 T306967: Deploy VE template dialog improvements to enwiki - https://phabricator.wikimedia.org/T306967
[07:08:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:08:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:11:32] * WMDE-Fisch 1st patch seems fine... moving on
[07:11:48] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: requestctl: fix small issues with the VSL translations [software/conftool] - 10https://gerrit.wikimedia.org/r/792555
[07:11:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: New version [software/conftool] - 10https://gerrit.wikimedia.org/r/792556
[07:11:52] <wikibugs>	 (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791315 (https://phabricator.wikimedia.org/T303802) (owner: 10WMDE-Fisch)
[07:12:38] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy template search improvements to enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791315 (https://phabricator.wikimedia.org/T303802) (owner: 10WMDE-Fisch)
[07:12:50] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] cr: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792208 (owner: 10Ayounsi)
[07:13:28] <wikibugs>	 (03Merged) 10jenkins-bot: cr: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792208 (owner: 10Ayounsi)
[07:14:07] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/792232 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey)
[07:14:23] <WMDE-Fisch>	 Testing on debug1001
[07:16:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792284 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:17:22] <XioNoX>	 !log core routers, split configuration per interfaces (use new get_junos_interfaces function)
[07:17:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:18:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:18:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:19:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:19:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:18] <wikibugs>	 (03PS1) 10Jaime Nuche: testwikis wikis to 1.39.0-wmf.12  refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792557
[07:20:20] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] testwikis wikis to 1.39.0-wmf.12  refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792557 (owner: 10Jaime Nuche)
[07:20:31] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[07:20:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, all commits with a @wikimedia.org address, I'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/792282 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:20:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:20:46] <logmsgbot>	 !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791315|Deploy template search improvements to enwiki (T303802)]] (duration: 02m 11s)
[07:20:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:51] <wikibugs>	 (03PS6) 10Slyngshede: Move Carbon Cache log cleanup to systemd tmpfile. [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673)
[07:20:52] <stashbot>	 T303802: Deploy template search improvements to enwiki - https://phabricator.wikimedia.org/T303802
[07:20:59] <WMDE-Fisch>	 synced, final tests
[07:21:08] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.12  refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792557 (owner: 10Jaime Nuche)
[07:21:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] zookeeper: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792282 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:22:01] <WMDE-Fisch>	 All good, I'm done!
[07:22:07] <logmsgbot>	 !log jnuche@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.12  refs T305218
[07:22:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:22:13] <stashbot>	 T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218
[07:22:26] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35302/console" [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:22:49] <kart_>	 WMDE-Fisch: Thanks. I'll also self-deploy..
[07:23:00] <WMDE-Fisch>	 kart_: Great!
[07:23:21] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Enable Section Translation in bcl, is, ne, pa, ts and ur Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791481 (https://phabricator.wikimedia.org/T304828) (owner: 10KartikMistry)
[07:23:29] <wikibugs>	 (03PS4) 10KartikMistry: Enable Section Translation in bcl, is, ne, pa, ts and ur Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/791481 (https://phabricator.wikimedia.org/T304828)
[07:23:56] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "Switched patch from systemd timers to systemd tmpfile instead." [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:25:14] <hashar>	 good morning!  Hi jnuche  :)
[07:25:15] <wikibugs>	 (03CR) 10Slyngshede: Update statistics::rsync::published to use SystemD timers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/789570 (https://phabricator.wikimedia.org/T123456) (owner: 10Slyngshede)
[07:25:43] <jnuche>	 morning hashar! 👋
[07:25:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:25:47] <hashar>	 I guess I missed we run the train at 9am cest
[07:25:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:44] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:26:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:26:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:49] <jnuche>	 just the prep as usual I thought, the first deploy to group0 will start after 10 cest as usual
[07:26:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:30] <hashar>	 for the risky patches, the new extension `SimilarEditors` should not cause any issue. It is merely shipping files that are not in use anywhere so people can "easily" turn on the extension whenever they are around
[07:27:55] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:27:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:09] <hashar>	 the other with database lazy connections, well I don't know. It sounds like every time we touch that area of code there is some surprising side effect spurting out :]
[07:28:30] <hashar>	 anyway they seem fine :]
[07:29:22] <kart_>	 'scap pull' on mwdebug1001 taking longer than usual time..
[07:29:31] <kart_>	 OK. now finished.
[07:30:07] <hashar>	 kart_: the longer scap pull I think that is because the new mediawiki 1.39.0-wmf.12 is on the deploy server
[07:30:20] <hashar>	 so it takes 2/3 minutes to rsync all of mediawiki code + l10n cache 
[07:31:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/792253 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[07:32:55] <kart_>	 hashar: What's this? `07:32:36 sync-file failed: <LockFailedError> Failed to acquire lock "/var/lock/scap.operations_mediawiki-config.lock"; owner is "jnuche"; reason is "testwikis wikis to 1.39.0-wmf.12  refs T305218"`
[07:32:55] <stashbot>	 T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218
[07:33:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:33:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:33:32] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: fix node-exim-queue' stats on no matches [puppet] - 10https://gerrit.wikimedia.org/r/792558 (https://phabricator.wikimedia.org/T305847)
[07:33:49] <kart_>	 jnuche ^^ We've config deployment window is on.
[07:34:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:34:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:34:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:14] <jnuche>	 kart_: sorry, that must be the train staging that I'm running right now, I didn't know it could affect the other deployment window
[07:34:37] <kart_>	 Yes. We've Window for that reason :)
[07:34:57] <jnuche>	 I'll just cancel
[07:35:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:35:00] <logmsgbot>	 !log jnuche@deploy1002 deploy-promote aborted:  (duration: 14m 44s)
[07:35:01] <logmsgbot>	 !log jnuche@deploy1002 stage-train aborted:  (duration: 25m 33s)
[07:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:35:18] <jnuche>	 kart_: done, you should be able to continue now
[07:35:20] <jnuche>	 sorry about that
[07:35:32] <jnuche>	 I'll wait until you're done
[07:35:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix node-exim-queue' stats on no matches [puppet] - 10https://gerrit.wikimedia.org/r/792558 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[07:35:40] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: fix node-exim-queue' stats on no matches [puppet] - 10https://gerrit.wikimedia.org/r/792558 (https://phabricator.wikimedia.org/T305847)
[07:35:41] <kart_>	 jnuche: Thanks and no problem! I'll take few minutes only..
[07:35:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MoritzMuehlenhoff)
[07:36:39] <logmsgbot>	 !log kartik@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:791481|Enable Section Translation in bcl, is, ne, pa, ts and ur Wikipedias (T304828)]] (duration: 00m 53s)
[07:36:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:44] <stashbot>	 T304828: Enable Section Translation in 13 wikis where Content Translation is already available as default - https://phabricator.wikimedia.org/T304828
[07:36:56] <kart_>	 !log UTC morning backport window - Done.
[07:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:02] <kart_>	 jnuche: You can go ahead.
[07:37:34] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Update statistics::publishd to use SystemD timers, rather than cron. [puppet] - 10https://gerrit.wikimedia.org/r/789599 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:37:36] <jnuche>	 kart_: thanks!
[07:38:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] commons: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792210 (owner: 10Ayounsi)
[07:39:13] <wikibugs>	 (03Merged) 10jenkins-bot: commons: use _get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792210 (owner: 10Ayounsi)
[07:39:22] <logmsgbot>	 !log jnuche@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.12  refs T305218
[07:39:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:27] <stashbot>	 T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218
[07:41:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792178 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup)
[07:43:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MoritzMuehlenhoff) >>! In T308013#7931137, @jcrespo wrote: >> Apache 2 seems to be used by puppet and the puppet modules, it retains the copyright so that seems fine to me...
[07:45:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10MoritzMuehlenhoff)
[07:48:01] <wikibugs>	 (03PS1) 10KartikMistry: Enable Section Translation in as, gu, kn, mk and, mr Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792559 (https://phabricator.wikimedia.org/T304828)
[07:49:40] <wikibugs>	 (03PS1) 10Slyngshede: Redirection not available for systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792560
[07:51:56] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:53:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/792560 (owner: 10Slyngshede)
[07:53:28] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Redirection not available for systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792560 (owner: 10Slyngshede)
[07:53:58] <logmsgbot>	 !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.12  refs T305218 (duration: 14m 35s)
[07:54:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:03] <stashbot>	 T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218
[07:54:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. One other option: Since prometheus-labs-targets.py is already shipped by us via Puppet we could also simply add a new option t" [puppet] - 10https://gerrit.wikimedia.org/r/792185 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:56:11] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] rdf query service: Apply WARN log level only to com.bigdata [puppet] - 10https://gerrit.wikimedia.org/r/792266 (https://phabricator.wikimedia.org/T306899) (owner: 10Ebernhardson)
[07:56:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792177 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup)
[07:56:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792175 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup)
[07:57:43] <wikibugs>	 (03PS2) 10Ladsgroup: orchestrator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792178 (https://phabricator.wikimedia.org/T308013)
[07:57:45] <wikibugs>	 (03PS1) 10Ayounsi: evpn: use get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792561
[07:58:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add SPDX headers to debdeploy/adduser/puppetboard modules [puppet] - 10https://gerrit.wikimedia.org/r/791596 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[07:58:30] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] orchestrator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792178 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup)
[07:59:18] <wikibugs>	 (03PS2) 10Ladsgroup: dbtree: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792175 (https://phabricator.wikimedia.org/T308013)
[08:00:05] <jouncebot>	 jnuche and hashar: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T0800).
[08:00:11] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] dbtree: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792175 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup)
[08:00:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:00:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:28] <wikibugs>	 (03PS2) 10Ladsgroup: proxysql: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792177 (https://phabricator.wikimedia.org/T308013)
[08:00:31] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] evpn: use get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792561 (owner: 10Ayounsi)
[08:01:14] <wikibugs>	 (03Merged) 10jenkins-bot: evpn: use get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792561 (owner: 10Ayounsi)
[08:01:14] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:01:16] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] proxysql: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792177 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup)
[08:03:38] <wikibugs>	 (03PS1) 10Jaime Nuche: group0 wikis to 1.39.0-wmf.12  refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792563
[08:03:40] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] group0 wikis to 1.39.0-wmf.12  refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792563 (owner: 10Jaime Nuche)
[08:04:21] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: port mediawiki php-fpm saturation alert [alerts] - 10https://gerrit.wikimedia.org/r/791356 (https://phabricator.wikimedia.org/T305847)
[08:04:23] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: port mx queue high page [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847)
[08:04:25] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.12  refs T305218 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792563 (owner: 10Jaime Nuche)
[08:05:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Note that we're not ready yet to merge this (not enough data in the metric IMHO), however I wanted to put it out there for your considerat" [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[08:05:58] <wikibugs>	 (03CR) 10Jcrespo: "Shouldn't dbtree be removed from puppet instead?" [puppet] - 10https://gerrit.wikimedia.org/r/792175 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup)
[08:06:02] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.12  refs T305218
[08:06:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:08] <stashbot>	 T305218: 1.39.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T305218
[08:06:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/789570 (https://phabricator.wikimedia.org/T123456) (owner: 10Slyngshede)
[08:07:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:07:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:07:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:08:36] <moritzm>	 !log installing ffmpeg security updates on stretch
[08:08:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:08:51] <wikibugs>	 (03PS1) 10Ladsgroup: Turn on read new for templatelinks on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792565 (https://phabricator.wikimedia.org/T306673)
[08:13:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:13:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] admin: add Antoine Musso to Phabricator hosts [puppet] - 10https://gerrit.wikimedia.org/r/792270 (https://phabricator.wikimedia.org/T308478) (owner: 10Hashar)
[08:14:07] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] dbtree: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792175 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup)
[08:15:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: Add Antoine Musso to Phabricator hosts - https://phabricator.wikimedia.org/T308478 (10Marostegui) 05Open→03Resolved a:03Marostegui merged the change and ran puppet on phab1001.eqiad.wmnet
[08:15:48] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10Marostegui) p:05Triage→03Medium
[08:16:03] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Incident, 10Wikimedia-production-error: 2022-05-14 Databases - https://phabricator.wikimedia.org/T308380 (10Marostegui) p:05Triage→03Medium
[08:16:15] <wikibugs>	 (03CR) 10Volans: "addressed comments" [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans)
[08:16:39] <wikibugs>	 (03PS4) 10Volans: service: add new module to expose service::catalog [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904
[08:17:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10wiki_willy) a:03Jclark-ctr
[08:17:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10Marostegui) p:05Triage→03Medium
[08:17:57] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Move automated target generation of Prometheus targets to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792185 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:18:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10Marostegui) p:05Triage→03Medium
[08:18:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate the IDPs to Bullseye - https://phabricator.wikimedia.org/T308214 (10Marostegui) p:05Triage→03Medium
[08:18:28] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:18:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:34] <wikibugs>	 10SRE, 10SRE-swift-storage, 10User-fgiunchedi: swift-account-stats failures on thanos-swift - https://phabricator.wikimedia.org/T307907 (10Marostegui) p:05Triage→03Medium
[08:18:51] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Implement SLI measurement for HAProxy - https://phabricator.wikimedia.org/T307898 (10Marostegui) p:05Triage→03Medium
[08:19:11] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops, 10Patch-For-Review: contint/releases/hosts with helm installed: puppet - Could not find group deployment - https://phabricator.wikimedia.org/T307740 (10Marostegui) p:05Triage→03Medium
[08:19:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Replace RAID controller battery in an-worker1081 - https://phabricator.wikimedia.org/T308434 (10wiki_willy) Hi @Jclark-ctr - this one is out of warranty, but let me know if you have any spares around or if we should purchase one.  Thanks, Willy
[08:19:29] <wikibugs>	 10SRE, 10RESTBase-API, 10Traffic, 10Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10Marostegui) p:05Triage→03Medium
[08:20:15] <wikibugs>	 10SRE, 10Patch-For-Review, 10Wikimedia-Incident: Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10Marostegui) p:05Triage→03Medium
[08:20:29] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move cloud vps ns-recursor IPs to host/row-independent addressing - https://phabricator.wikimedia.org/T307357 (10Marostegui) p:05Triage→03Medium
[08:20:43] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Betacommons: 504, Connection Timed Out at 2022-05-02 13:35:16 GMT - https://phabricator.wikimedia.org/T307354 (10Marostegui) p:05Triage→03Medium
[08:21:04] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@b569ee8]: Update DAG spark conf [airflow-dags/analytics@b569ee8]
[08:21:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:11] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@b569ee8]: Update DAG spark conf [airflow-dags/analytics@b569ee8] (duration: 00m 07s)
[08:21:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:58] <wikibugs>	 (03PS1) 10Ayounsi: switches: use get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792566
[08:24:21] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Update statistics::rsync::published to use SystemD timers [puppet] - 10https://gerrit.wikimedia.org/r/789570 (https://phabricator.wikimedia.org/T123456) (owner: 10Slyngshede)
[08:25:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:25:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:25:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:18] <wikibugs>	 (03CR) 10Marostegui: auto_schema: Make alter non-blocking on master of primary dc (031 comment) [software] - 10https://gerrit.wikimedia.org/r/791297 (owner: 10Ladsgroup)
[08:28:11] <wikibugs>	 (03PS1) 10Cathal Mooney: Add policer config to swithes [homer/public] - 10https://gerrit.wikimedia.org/r/792567
[08:28:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:28:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:41] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove webperf1001/2001 from Scap config [puppet] - 10https://gerrit.wikimedia.org/r/791300
[08:30:05] <Amir1>	 jouncebot: nowandnext
[08:30:05] <jouncebot>	 For the next 1 hour(s) and 29 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T0800)
[08:30:05] <jouncebot>	 In 4 hour(s) and 29 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T1300)
[08:32:25] <wikibugs>	 (03PS1) 10Ladsgroup: ContribsPager: Update index hint to use revision table in READ NEW [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792474 (https://phabricator.wikimedia.org/T307295)
[08:32:39] <wikibugs>	 (03PS1) 10Ladsgroup: ContribsPager: Update index hint to use revision table in READ NEW [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792475 (https://phabricator.wikimedia.org/T307295)
[08:32:47] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] ContribsPager: Update index hint to use revision table in READ NEW [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792475 (https://phabricator.wikimedia.org/T307295) (owner: 10Ladsgroup)
[08:33:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove webperf1001/2001 from Scap config [puppet] - 10https://gerrit.wikimedia.org/r/791300 (owner: 10Muehlenhoff)
[08:33:37] <Amir1>	 jnuche: hi, I'm going to backport some stuff, are you done with the train?
[08:34:32] <jnuche>	 Amir1: hi, yeah, you can go aead
[08:34:36] <jnuche>	 *ahead
[08:34:48] <Amir1>	 Thanks!
[08:34:57] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] ContribsPager: Update index hint to use revision table in READ NEW [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792474 (https://phabricator.wikimedia.org/T307295) (owner: 10Ladsgroup)
[08:35:51] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Turn on read new for templatelinks on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792565 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup)
[08:37:25] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on read new for templatelinks on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792565 (https://phabricator.wikimedia.org/T306673) (owner: 10Ladsgroup)
[08:38:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:38:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:31] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:792565|Turn on read new for templatelinks on frwiki (T306673)]] (duration: 02m 25s)
[08:40:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:36] <stashbot>	 T306673: Turn on read new for templatelinks on beta and production - https://phabricator.wikimedia.org/T306673
[08:43:54] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:45:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[08:45:50] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[08:45:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 5%: After depooling', diff saved to https://phabricator.wikimedia.org/P27833 and previous config saved to /var/cache/conftool/dbconfig/20220517-084704-root.json
[08:47:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:47:51] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Incident, 10Wikimedia-production-error: 2022-05-14 Databases - https://phabricator.wikimedia.org/T308380 (10Marostegui) I have tweaked db1172's weight and I am slowly repooling it
[08:48:22] <icinga-wm>	 RECOVERY - Host an-tool1007 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms
[08:48:28] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.reimage for host ganeti4002.ulsfo.wmnet with OS bullseye
[08:48:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1001 for host ganeti4002.ulsfo.wmnet with OS bullseye
[08:49:40] <icinga-wm>	 PROBLEM - turnilo.wikimedia.org requires authentication on an-tool1007 is CRITICAL: connect to address 10.64.36.118 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[08:50:00] <icinga-wm>	 PROBLEM - turnilo.wikimedia.org tls expiry on an-tool1007 is CRITICAL: connect to address 10.64.36.118 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[08:50:00] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Move Wiki Rsync fetch jobs to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790967 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[08:50:06] <icinga-wm>	 ACKNOWLEDGEMENT - turnilo.wikimedia.org requires authentication on an-tool1007 is CRITICAL: connect to address 10.64.36.118 and port 443: Connection refused Btullis Working on the upgrade in T301990 https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[08:50:06] <icinga-wm>	 ACKNOWLEDGEMENT - turnilo.wikimedia.org tls expiry on an-tool1007 is CRITICAL: connect to address 10.64.36.118 and port 443: Connection refused Btullis Working on the upgrade in T301990 https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[08:51:16] <wikibugs>	 (03PS2) 10Ladsgroup: mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013)
[08:51:23] <wikibugs>	 (03PS3) 10Ladsgroup: mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013)
[08:52:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[08:52:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:30] <wikibugs>	 (03Merged) 10jenkins-bot: ContribsPager: Update index hint to use revision table in READ NEW [core] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792475 (https://phabricator.wikimedia.org/T307295) (owner: 10Ladsgroup)
[08:52:36] <wikibugs>	 (03Merged) 10jenkins-bot: ContribsPager: Update index hint to use revision table in READ NEW [core] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/792474 (https://phabricator.wikimedia.org/T307295) (owner: 10Ladsgroup)
[08:54:31] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.12/includes/specials/pagers/ContribsPager.php: Backport: [[gerrit:792475|ContribsPager: Update index hint to use revision table in READ NEW (T307295)]] (duration: 00m 56s)
[08:54:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:36] <stashbot>	 T307295: Bot contributions page in Catalan wikipedia not displayed - https://phabricator.wikimedia.org/T307295
[08:57:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[08:57:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:04] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.10/includes/specials/pagers/ContribsPager.php: Backport: [[gerrit:792474|ContribsPager: Update index hint to use revision table in READ NEW (T307295)]] (duration: 00m 53s)
[08:59:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Don't schedule calico kube-controllers on master nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/777364 (owner: 10JMeybohm)
[09:01:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7933266, @JMeybohm wrote: >>>! In T306649#7931058, @akosiaris wrote: >>> Regarding the "fake nodes": I...
[09:02:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 10%: After depooling', diff saved to https://phabricator.wikimedia.org/P27834 and previous config saved to /var/cache/conftool/dbconfig/20220517-090208-root.json
[09:02:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:04:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:04:54] <icinga-wm>	 PROBLEM - puppet last run on an-tool1007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.118: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:04:58] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.118: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:05:14] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on an-tool1007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.118: Connection reset by peer https://wikitech.wikimedia.org/wiki/NTP
[09:05:22] <icinga-wm>	 PROBLEM - Check that envoy is running on an-tool1007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.118: Connection reset by peer https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[09:05:40] <icinga-wm>	 PROBLEM - DPKG on an-tool1007 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.118: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[09:05:56] <logmsgbot>	 !log jmm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4002.ulsfo.wmnet with reason: host reimage
[09:05:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:06:32] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:07:00] <icinga-wm>	 RECOVERY - Check that envoy is running on an-tool1007 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy
[09:09:21] <wikibugs>	 (03CR) 10Muehlenhoff: aptrepo: import gitlab package for bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792108 (https://phabricator.wikimedia.org/T307142) (owner: 10Jelto)
[09:09:36] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4002.ulsfo.wmnet with reason: host reimage
[09:09:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:17] <icinga-wm>	 RECOVERY - puppet last run on an-tool1007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[09:10:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:10:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:51] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:11:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792125 (https://phabricator.wikimedia.org/T304497) (owner: 10Volans)
[09:12:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: mx: remove queue size alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847)
[09:12:42] <wikibugs>	 (03PS1) 10Filippo Giunchedi: fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847)
[09:13:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mx: remove queue size alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[09:13:39] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1164: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/792476
[09:13:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[09:14:31] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add new cloudsw to rancid for config backup [puppet] - 10https://gerrit.wikimedia.org/r/791600 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[09:14:34] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1164: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/792476 (owner: 10Marostegui)
[09:14:49] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "PCC happy: https://puppet-compiler.wmflabs.org/pcc-worker1001/35304/cumin1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/792125 (https://phabricator.wikimedia.org/T304497) (owner: 10Volans)
[09:15:05] <wikibugs>	 (03PS3) 10Filippo Giunchedi: mediawiki: remove idle php-fpm workers alert, moved to prometheus/alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/791360 (https://phabricator.wikimedia.org/T305847)
[09:15:07] <wikibugs>	 (03PS2) 10Filippo Giunchedi: mx: remove queue size alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847)
[09:15:09] <wikibugs>	 (03PS2) 10Filippo Giunchedi: fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847)
[09:15:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:15:28] <wikibugs>	 (03CR) 10Muehlenhoff: "I had no idea the same cron was also copied over to the other role. We can properly address this by reducing code duplication: If we creat" [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[09:15:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mx: remove queue size alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[09:16:14] <logmsgbot>	 !log btullis@deploy1002 Started deploy [analytics/turnilo/deploy@bf60521]: (no justification provided)
[09:16:17] <logmsgbot>	 !log btullis@deploy1002 Finished deploy [analytics/turnilo/deploy@bf60521]: (no justification provided) (duration: 00m 03s)
[09:16:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:16:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:16:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:17:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 25%: After depooling', diff saved to https://phabricator.wikimedia.org/P27835 and previous config saved to /var/cache/conftool/dbconfig/20220517-091712-root.json
[09:17:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] switches: use get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792566 (owner: 10Ayounsi)
[09:19:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:19:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:59] <icinga-wm>	 RECOVERY - turnilo.wikimedia.org requires authentication on an-tool1007 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 546 bytes in 1.006 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[09:20:20] <wikibugs>	 (03Merged) 10jenkins-bot: switches: use get_junos_interfaces [homer/public] - 10https://gerrit.wikimedia.org/r/792566 (owner: 10Ayounsi)
[09:20:54] <XioNoX>	 !log all switches, split configuration per interfaces (use new get_junos_interfaces function)
[09:20:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:05] <icinga-wm>	 RECOVERY - Disk space on ms-be1040 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1040&var-datasource=eqiad+prometheus/ops
[09:22:35] <icinga-wm>	 RECOVERY - turnilo.wikimedia.org tls expiry on an-tool1007 is OK: OK - Certificate yarn.wikimedia.org will expire on Sat 01 May 2027 07:37:58 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[09:24:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Don't schedule calico kube-controllers on master nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/777364 (owner: 10JMeybohm)
[09:25:10] <logmsgbot>	 !log jmm@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4002.ulsfo.wmnet with OS bullseye
[09:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1001 for host ganeti4002.ulsfo.wmnet with OS bullseye completed: - ganeti4002 (**PASS**)   - Downtimed on Icinga/Aler...
[09:26:58] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] Move Carbon Cache log cleanup to systemd tmpfile. [puppet] - 10https://gerrit.wikimedia.org/r/792155 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[09:32:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 50%: After depooling', diff saved to https://phabricator.wikimedia.org/P27836 and previous config saved to /var/cache/conftool/dbconfig/20220517-093216-root.json
[09:32:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:37] <wikibugs>	 (03PS1) 10Ayounsi: switch interfaces: sort vlans [homer/public] - 10https://gerrit.wikimedia.org/r/792571
[09:36:25] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on an-tool1007 is OK: OK: synced at Tue 2022-05-17 09:36:24 UTC. https://wikitech.wikimedia.org/wiki/NTP
[09:36:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] switch interfaces: sort vlans [homer/public] - 10https://gerrit.wikimedia.org/r/792571 (owner: 10Ayounsi)
[09:36:53] <icinga-wm>	 RECOVERY - DPKG on an-tool1007 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[09:37:20] <wikibugs>	 (03Merged) 10jenkins-bot: switch interfaces: sort vlans [homer/public] - 10https://gerrit.wikimedia.org/r/792571 (owner: 10Ayounsi)
[09:39:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Reduce the scope of Calico's global BGP Peers for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/792232 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey)
[09:44:57] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1007 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:45:05] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:47:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 75%: After depooling', diff saved to https://phabricator.wikimedia.org/P27837 and previous config saved to /var/cache/conftool/dbconfig/20220517-094719-root.json
[09:47:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:55] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:54:00] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:55:32] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "Thanks for the patch, very much appreciated, but i wonder if this is the file you had intended to patch.  software/puppet-compiler is the " [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall)
[10:00:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations: puppetmaster1001 disk warning on / - https://phabricator.wikimedia.org/T304898 (10Marostegui) 05Open→03Resolved a:03MoritzMuehlenhoff @MoritzMuehlenhoff dropped a bunch of `/tmp/tmp.*` and the disk is back to 64%: ` root@puppetmaster1001:/var/log/apache2# df -hT / Fil...
[10:02:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1172 (re)pooling @ 100%: After depooling', diff saved to https://phabricator.wikimedia.org/P27838 and previous config saved to /var/cache/conftool/dbconfig/20220517-100223-root.json
[10:02:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792253 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[10:03:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] C:helm:  make the group permissions on helm_cache configurable [puppet] - 10https://gerrit.wikimedia.org/r/791565 (https://phabricator.wikimedia.org/T305729) (owner: 10Jbond)
[10:04:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] codesearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792284 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[10:05:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/792284 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[10:05:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] libraryupgrader: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792253 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[10:09:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/775904 (owner: 10Volans)
[10:09:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM thanks" [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup)
[10:15:54] <wikibugs>	 (03PS1) 10Jbond: admin - gitlab-roots: add *contint_roots_members to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/792576 (https://phabricator.wikimedia.org/T308350)
[10:16:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4002.ulsfo.wmnet
[10:16:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:17:45] <wikibugs>	 (03PS1) 10JMeybohm: Add a rake task to generate JSON schema for chart CRDs on the fly [deployment-charts] - 10https://gerrit.wikimedia.org/r/792577 (https://phabricator.wikimedia.org/T306165)
[10:20:13] <wikibugs>	 (03PS3) 10Filippo Giunchedi: fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847)
[10:21:05] <wikibugs>	 (03PS2) 10JMeybohm: Add a rake task to generate JSON schema for chart CRDs on the fly [deployment-charts] - 10https://gerrit.wikimedia.org/r/792577 (https://phabricator.wikimedia.org/T306165)
[10:22:53] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:24:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4002.ulsfo.wmnet
[10:24:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:57] <wikibugs>	 (03PS3) 10JMeybohm: Add a rake task to generate JSON schema for chart CRDs on the fly [deployment-charts] - 10https://gerrit.wikimedia.org/r/792577 (https://phabricator.wikimedia.org/T306165)
[10:29:19] <wikibugs>	 (03CR) 10Cathal Mooney: Add new subnets for cloudsw expansion Eqiad to netops infrastructure (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/791585 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[10:31:03] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup)
[10:32:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35306/console" [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[10:32:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4002.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet
[10:32:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:32:27] <wikibugs>	 (03PS4) 10Jbond: sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251
[10:32:29] <wikibugs>	 (03PS3) 10Hnowlan: Set production role and add config for restbase2027 [puppet] - 10https://gerrit.wikimedia.org/r/779846
[10:32:50] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti4002.ulsfo.wmnet to ganeti01.svc.ulsfo.wmnet
[10:32:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sre.host.pxe: Cookbook to configure dhcp option82 and reboot into pxe [cookbooks] - 10https://gerrit.wikimedia.org/r/792251 (owner: 10Jbond)
[10:36:19] <icinga-wm>	 PROBLEM - Disk space on ms-be1040 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdl1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1040&var-datasource=eqiad+prometheus/ops
[10:41:04] <wikibugs>	 (03PS3) 10Jbond: dhcp: DHCPConfOpt82 media_type parameter [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238
[10:41:09] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "updated thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 (owner: 10Jbond)
[10:41:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Add SPDX headers for routinator/diffscan/bgpalerter/gobgpd/homer [puppet] - 10https://gerrit.wikimedia.org/r/792579 (https://phabricator.wikimedia.org/T308013)
[10:50:09] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: New version [software/conftool] - 10https://gerrit.wikimedia.org/r/792556
[10:50:11] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: requestctl: always add header for detection [software/conftool] - 10https://gerrit.wikimedia.org/r/792580
[10:50:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: requestctl: do not ask for confirmation for emtpy changes [software/conftool] - 10https://gerrit.wikimedia.org/r/792581
[10:55:34] <wikibugs>	 (03PS4) 10Ladsgroup: mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013)
[10:55:40] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] mediabackup: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792176 (https://phabricator.wikimedia.org/T308013) (owner: 10Ladsgroup)
[10:58:53] <wikibugs>	 (03PS2) 10Slyngshede: Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673)
[10:59:13] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[10:59:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:00:58] <wikibugs>	 (03PS3) 10Slyngshede: Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673)
[11:01:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:01:57] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[11:02:47] <wikibugs>	 (03PS4) 10Slyngshede: Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673)
[11:07:35] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35307/console" [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:09:47] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35308/console" [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:14:46] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792579 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[11:16:12] <wikibugs>	 (03CR) 10Slyngshede: Move restart of slapd, due to memory leaks, to systemd timers. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:22:46] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[11:44:57] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] Move rabbitmq to systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791367 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[11:49:17] <wikibugs>	 (03PS3) 10Filippo Giunchedi: mx: remove queue size alert, moved to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847)
[11:49:19] <wikibugs>	 (03PS4) 10Filippo Giunchedi: fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847)
[11:51:43] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:53:51] <moritzm>	 !log failover Ganeti master in ulsfo to ganeti4001 T307997
[11:53:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:58] <stashbot>	 T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997
[11:55:07] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:57:57] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti4003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[11:58:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[12:00:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[12:00:15] <wikibugs>	 (03PS5) 10Filippo Giunchedi: fastnetmon: export notification count as metric [puppet] - 10https://gerrit.wikimedia.org/r/792569 (https://phabricator.wikimedia.org/T305847)
[12:00:25] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:02:40] <wikibugs>	 (03PS1) 10Jbond: redfish: add support to upload files via the request method [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595
[12:04:16] <moritzm>	 !log draining ganeti4003 T307997
[12:04:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:21] <stashbot>	 T307997: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997
[12:14:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10awight)
[12:19:50] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[12:19:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[12:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:57] <logmsgbot>	 !log hnowlan@deploy1002 Started deploy [restbase/deploy@6e39559]: Add kcgwiki - T305281
[12:19:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:04] <stashbot>	 T305281: Post-creation work for kcgwiki - https://phabricator.wikimedia.org/T305281
[12:21:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[12:21:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance
[12:21:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27840 and previous config saved to /var/cache/conftool/dbconfig/20220517-122201-ladsgroup.json
[12:22:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:14] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[12:25:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27841 and previous config saved to /var/cache/conftool/dbconfig/20220517-122517-ladsgroup.json
[12:25:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] fix_logging.log_timestamp_type_T298555.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) (owner: 10Kormat)
[12:26:01] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] fix_revision.rev_timestamp_type_T298560.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788290 (https://phabricator.wikimedia.org/T298560) (owner: 10Kormat)
[12:27:34] <wikibugs>	 (03Merged) 10jenkins-bot: fix_logging.log_timestamp_type_T298555.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788291 (https://phabricator.wikimedia.org/T298555) (owner: 10Kormat)
[12:27:38] <wikibugs>	 (03Merged) 10jenkins-bot: fix_revision.rev_timestamp_type_T298560.py: New schema change. [software/schema-changes] - 10https://gerrit.wikimedia.org/r/788290 (https://phabricator.wikimedia.org/T298560) (owner: 10Kormat)
[12:36:44] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, feel free to merge as is. Couple of questions inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond)
[12:39:49] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Move Hadoop eventlogs cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[12:39:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:39:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:39:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:40:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27842 and previous config saved to /var/cache/conftool/dbconfig/20220517-124022-ladsgroup.json
[12:40:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:09] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The following units failed: swift-object.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:42:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[12:42:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[12:42:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[12:42:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:22] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[12:42:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T298560)', diff saved to https://phabricator.wikimedia.org/P27843 and previous config saved to /var/cache/conftool/dbconfig/20220517-124227-ladsgroup.json
[12:42:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:34] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[12:44:47] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:47:15] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 (owner: 10Jbond)
[12:55:04] <wikibugs>	 (03CR) 10Kosta Harlan: "should we consider adding the messages on wiki via MediaWiki namespace, as syncing the i18n updates is time-consuming?" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792478 (https://phabricator.wikimedia.org/T305659) (owner: 10Gergő Tisza)
[12:55:21] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] requestctl: fix small issues with the VSL translations [software/conftool] - 10https://gerrit.wikimedia.org/r/792555 (owner: 10Giuseppe Lavagetto)
[12:55:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P27844 and previous config saved to /var/cache/conftool/dbconfig/20220517-125527-ladsgroup.json
[12:55:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:33] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add SPDX headers for routinator/diffscan/bgpalerter/gobgpd/homer [puppet] - 10https://gerrit.wikimedia.org/r/792579 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:55:37] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] requestctl: always add header for detection [software/conftool] - 10https://gerrit.wikimedia.org/r/792580 (owner: 10Giuseppe Lavagetto)
[12:56:09] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] requestctl: do not ask for confirmation for emtpy changes [software/conftool] - 10https://gerrit.wikimedia.org/r/792581 (owner: 10Giuseppe Lavagetto)
[12:56:27] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] "please fix then lgtm" [software/conftool] - 10https://gerrit.wikimedia.org/r/792556 (owner: 10Giuseppe Lavagetto)
[12:57:00] <wikibugs>	 (03PS1) 10Zabe: wmfmariadbpy: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792607 (https://phabricator.wikimedia.org/T308013)
[12:57:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[12:57:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[12:57:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T300774)', diff saved to https://phabricator.wikimedia.org/P27845 and previous config saved to /var/cache/conftool/dbconfig/20220517-125713-ladsgroup.json
[12:57:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:18] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[12:57:19] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add policer config to swithes [homer/public] - 10https://gerrit.wikimedia.org/r/792567 (owner: 10Cathal Mooney)
[12:58:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Reduce the scope of Calico's global BGP Peers for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/792232 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey)
[12:58:34] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "I did a manual run of backups for both clients. They got the following amount of data:" [puppet] - 10https://gerrit.wikimedia.org/r/792125 (https://phabricator.wikimedia.org/T304497) (owner: 10Volans)
[12:58:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/787831 (owner: 10PipelineBot)
[12:59:12] <wikibugs>	 (03PS1) 10Zabe: wikilabels: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792608 (https://phabricator.wikimedia.org/T308013)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC afternoon backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T1300).
[13:00:04] <jouncebot>	 tgr: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:01:10] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[13:01:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[13:01:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:45] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:02:15] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv4: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad, AS64606/IPv6: Active - kubernetes-ml-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:02:41] <tgr>	 I'll deploy
[13:02:51] <Amir1>	 !log killed cawiki's refreshLinkRecommendations.php (T299021)
[13:02:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:57] <stashbot>	 T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021
[13:03:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T300774)', diff saved to https://phabricator.wikimedia.org/P27846 and previous config saved to /var/cache/conftool/dbconfig/20220517-130322-ladsgroup.json
[13:03:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:03:28] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[13:03:53] <wikibugs>	 (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/787831 (owner: 10PipelineBot)
[13:04:08] <wikibugs>	 (03PS1) 10Zabe: visualdiff: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792609 (https://phabricator.wikimedia.org/T308013)
[13:05:37] <wikibugs>	 (03CR) 10Gergő Tisza: Account creation: add Thank you banner texts (031 comment) [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792478 (https://phabricator.wikimedia.org/T305659) (owner: 10Gergő Tisza)
[13:05:42] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+2] Account creation: add Thank you banner texts [extensions/GrowthExperiments] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792478 (https://phabricator.wikimedia.org/T305659) (owner: 10Gergő Tisza)
[13:09:33] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cluster::management: backup auditing logs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792125 (https://phabricator.wikimedia.org/T304497) (owner: 10Volans)
[13:10:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27848 and previous config saved to /var/cache/conftool/dbconfig/20220517-131032-ladsgroup.json
[13:10:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[13:10:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance
[13:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:38] <wikibugs>	 (03PS1) 10Elukey: Allow BGP from calico pods running on master nodes on ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/792611 (https://phabricator.wikimedia.org/T308418)
[13:10:39] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[13:10:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T303603)', diff saved to https://phabricator.wikimedia.org/P27849 and previous config saved to /var/cache/conftool/dbconfig/20220517-131040-ladsgroup.json
[13:10:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:10:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:58] <elukey>	 '12
[13:12:01] <elukey>	 uff
[13:14:19] <wikibugs>	 (03PS2) 10Jbond: redfish: update signature of requests method with kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595
[13:14:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T303603)', diff saved to https://phabricator.wikimedia.org/P27850 and previous config saved to /var/cache/conftool/dbconfig/20220517-131453-ladsgroup.json
[13:14:58] <wikibugs>	 (03CR) 10Elukey: "The alternative could be to just remove BGP session configuration from the homer public repository, but it may be confusing if we'll want " [deployment-charts] - 10https://gerrit.wikimedia.org/r/792611 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey)
[13:14:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:10] <wikibugs>	 (03CR) 10Jbond: "thanks updated to pass kwargs as per irc discussion" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond)
[13:16:10] <wikibugs>	 (03CR) 10BCornwall: cli: Add support for XDG Base Directory spec (033 comments) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall)
[13:18:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P27851 and previous config saved to /var/cache/conftool/dbconfig/20220517-131827-ladsgroup.json
[13:18:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:05] <wikibugs>	 (03PS1) 10Andrew Bogott: icinga: remove creds for a couple of departed WMCS SREs [puppet] - 10https://gerrit.wikimedia.org/r/792612
[13:21:07] <wikibugs>	 (03PS1) 10Andrew Bogott: icinga: Added Camel Case version of my name as authorized user [puppet] - 10https://gerrit.wikimedia.org/r/792613 (https://phabricator.wikimedia.org/T275920)
[13:23:09] <wikibugs>	 10SRE, 10Generated Data Platform, 10Image-Suggestions, 10serviceops, and 2 others: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10hnowlan)
[13:24:14] <wikibugs>	 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata)
[13:26:19] <wikibugs>	 (03CR) 10Volans: "In light of IRC chats and the previous comments, I did a full pass and make some suggestions on how to align this more with the SREBatchBa" [cookbooks] - 10https://gerrit.wikimedia.org/r/789680 (https://phabricator.wikimedia.org/T260661) (owner: 10JMeybohm)
[13:29:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27852 and previous config saved to /var/cache/conftool/dbconfig/20220517-132958-ladsgroup.json
[13:30:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:55] <wikibugs>	 (03CR) 10Hnowlan: New service: image-suggestion (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[13:33:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P27853 and previous config saved to /var/cache/conftool/dbconfig/20220517-133333-ladsgroup.json
[13:33:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:55] <tgr>	 somehow stuck in "ready to submit". Guess I'll have to force merge.
[13:35:18] <wikibugs>	 (03PS4) 10Jbond: dhcp: DHCPConfOpt82 and DHCPConfMac media_type parameter [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238
[13:35:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Papaul) @Dzahn are you planning on re-imaging the server after the move so I know what approach to take for the IP change?
[13:36:08] <zabe>	 the gate-and-submit pipeline was still running, it was ready to submit due to the V+2 from the main test build
[13:39:13] <tgr>	 I see. It's an i18n-only patch so the tests wouldn't have much use anyway.
[13:39:23] <wikibugs>	 (03CR) 10Volans: "Nit inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond)
[13:39:35] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Allow BGP from calico pods running on master nodes on ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/792611 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey)
[13:39:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Don't schedule calico kube-controllers on master nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/777364 (owner: 10JMeybohm)
[13:40:13] <logmsgbot>	 !log tgr@deploy1002 Started scap: Backport with i18n changes: [[gerrit:792478|Account creation: add Thank you banner texts]]
[13:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:35] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] cli: Add support for XDG Base Directory spec (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall)
[13:43:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: remove creds for a couple of departed WMCS SREs [puppet] - 10https://gerrit.wikimedia.org/r/792612 (owner: 10Andrew Bogott)
[13:43:40] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability, 10Patch-For-Review: icinga login case mismatch - https://phabricator.wikimedia.org/T275920 (10Andrew) One proposal (which may or may not be possible) would be to standardize on all-lowercase logins in icinga config, and then have our login front...
[13:44:26] <wikibugs>	 (03Merged) 10jenkins-bot: Don't schedule calico kube-controllers on master nodes [deployment-charts] - 10https://gerrit.wikimedia.org/r/777364 (owner: 10JMeybohm)
[13:45:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] icinga: Added Camel Case version of my name as authorized user [puppet] - 10https://gerrit.wikimedia.org/r/792613 (https://phabricator.wikimedia.org/T275920) (owner: 10Andrew Bogott)
[13:45:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P27854 and previous config saved to /var/cache/conftool/dbconfig/20220517-134503-ladsgroup.json
[13:45:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:45:59] <wikibugs>	 (03PS3) 10Jbond: redfish: update signature of requests method with kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595
[13:46:01] <wikibugs>	 (03CR) 10Jbond: "updated thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond)
[13:46:21] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] dhcp: DHCPConfOpt82 and DHCPConfMac media_type parameter [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 (owner: 10Jbond)
[13:46:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Allow BGP from calico pods running on master nodes on ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/792611 (https://phabricator.wikimedia.org/T308418) (owner: 10Elukey)
[13:46:56] <wikibugs>	 10SRE, 10Icinga, 10Observability-Alerting, 10observability, 10Patch-For-Review: icinga login case mismatch - https://phabricator.wikimedia.org/T275920 (10fgiunchedi) I'm ok to stick with capitalized names since that's the convention and AFAICT the default / expected format.
[13:47:25] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond)
[13:48:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] icinga: remove creds for a couple of departed WMCS SREs [puppet] - 10https://gerrit.wikimedia.org/r/792612 (owner: 10Andrew Bogott)
[13:48:14] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] icinga: Added Camel Case version of my name as authorized user [puppet] - 10https://gerrit.wikimedia.org/r/792613 (https://phabricator.wikimedia.org/T275920) (owner: 10Andrew Bogott)
[13:48:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T300774)', diff saved to https://phabricator.wikimedia.org/P27855 and previous config saved to /var/cache/conftool/dbconfig/20220517-134838-ladsgroup.json
[13:48:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:43] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[13:49:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[13:50:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[13:50:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[13:50:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:04] <wikibugs>	 (03Abandoned) 10Thiemo Kreuz (WMDE): Duplicate "latest revision may be special" logic from FlaggedRevs [extensions/Kartographer] (wmf/1.39.0-wmf.10) - 10https://gerrit.wikimedia.org/r/791248 (https://phabricator.wikimedia.org/T304813) (owner: 10Awight)
[13:50:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T300774)', diff saved to https://phabricator.wikimedia.org/P27856 and previous config saved to /var/cache/conftool/dbconfig/20220517-135006-ladsgroup.json
[13:50:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin - gitlab-roots: add *contint_roots_members to gitlab-roots [puppet] - 10https://gerrit.wikimedia.org/r/792576 (https://phabricator.wikimedia.org/T308350) (owner: 10Jbond)
[13:52:18] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[13:52:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:53:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: fix small issues with the VSL translations [software/conftool] - 10https://gerrit.wikimedia.org/r/792555 (owner: 10Giuseppe Lavagetto)
[13:54:00] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:54:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T300774)', diff saved to https://phabricator.wikimedia.org/P27857 and previous config saved to /var/cache/conftool/dbconfig/20220517-135401-ladsgroup.json
[13:54:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:07] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[13:54:45] <wikibugs>	 (03Merged) 10jenkins-bot: dhcp: DHCPConfOpt82 and DHCPConfMac media_type parameter [software/spicerack] - 10https://gerrit.wikimedia.org/r/792238 (owner: 10Jbond)
[13:55:01] <wikibugs>	 (03PS1) 10Majavah: openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666)
[13:55:11] <logmsgbot>	 !log tgr@deploy1002 Finished scap: Backport with i18n changes: [[gerrit:792478|Account creation: add Thank you banner texts]] (duration: 14m 57s)
[13:55:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:55:53] <wikibugs>	 (03Merged) 10jenkins-bot: requestctl: fix small issues with the VSL translations [software/conftool] - 10https://gerrit.wikimedia.org/r/792555 (owner: 10Giuseppe Lavagetto)
[13:55:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah)
[13:56:51] <wikibugs>	 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ganeti4002 dimm error - https://phabricator.wikimedia.org/T303318 (10RobH) a:05RobH→03MoritzMuehlenhoff @MoritzMuehlenhoff,  Can we plan to have ganeti4002 drained of activity for me on Thursday, May 19th, so I can swap out the defective memory stick?
[13:56:57] <wikibugs>	 (03PS2) 10Majavah: openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666)
[13:58:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah)
[13:58:38] <wikibugs>	 (03PS3) 10Majavah: openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666)
[13:59:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10elukey) I merged two changes for the ml-serve-eqiad cluster, and now the concerns expressed in T306649#7881940 should be gone:...
[13:59:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: always add header for detection [software/conftool] - 10https://gerrit.wikimedia.org/r/792580 (owner: 10Giuseppe Lavagetto)
[14:00:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T303603)', diff saved to https://phabricator.wikimedia.org/P27858 and previous config saved to /var/cache/conftool/dbconfig/20220517-140008-ladsgroup.json
[14:00:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[14:00:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance
[14:00:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:15] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[14:00:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27859 and previous config saved to /var/cache/conftool/dbconfig/20220517-140016-ladsgroup.json
[14:00:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:02:42] <wikibugs>	 (03Merged) 10jenkins-bot: requestctl: always add header for detection [software/conftool] - 10https://gerrit.wikimedia.org/r/792580 (owner: 10Giuseppe Lavagetto)
[14:04:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27860 and previous config saved to /var/cache/conftool/dbconfig/20220517-140431-ladsgroup.json
[14:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:36] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply
[14:05:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:10] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply
[14:06:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:06] <wikibugs>	 (03CR) 10Majavah: "tested on codfw1dev" [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah)
[14:07:10] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply
[14:07:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[14:08:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:04] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply
[14:08:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:12] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply
[14:08:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:29] <wikibugs>	 (03PS4) 10Majavah: openstack: Make enc api enforce keystone policy [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666)
[14:09:06] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: do not ask for confirmation for emtpy changes [software/conftool] - 10https://gerrit.wikimedia.org/r/792581 (owner: 10Giuseppe Lavagetto)
[14:09:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27861 and previous config saved to /var/cache/conftool/dbconfig/20220517-140906-ladsgroup.json
[14:09:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:24] <wikibugs>	 (03PS1) 10Ayounsi: wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792621
[14:10:15] <wikibugs>	 (03CR) 10Volans: "It does include also a change in the description, is that wanted?" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792621 (owner: 10Ayounsi)
[14:11:15] <wikibugs>	 (03Merged) 10jenkins-bot: requestctl: do not ask for confirmation for emtpy changes [software/conftool] - 10https://gerrit.wikimedia.org/r/792581 (owner: 10Giuseppe Lavagetto)
[14:11:45] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Add Aiko and Kevin to the deployment posix group - https://phabricator.wikimedia.org/T308308 (10elukey)
[14:12:30] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply
[14:12:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:35] <wikibugs>	 (03PS2) 10Ayounsi: wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792621
[14:14:04] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] New version [software/conftool] - 10https://gerrit.wikimedia.org/r/792556 (owner: 10Giuseppe Lavagetto)
[14:14:44] <wikibugs>	 (03PS3) 10Ayounsi: wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792621
[14:14:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[14:14:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[14:14:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:56] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: New version [software/conftool] - 10https://gerrit.wikimedia.org/r/792556
[14:16:01] <wikibugs>	 (03Abandoned) 10Ayounsi: wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792621 (owner: 10Ayounsi)
[14:17:25] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: New version (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/792556 (owner: 10Giuseppe Lavagetto)
[14:18:18] <wikibugs>	 (03PS1) 10Ayounsi: wmf-netbox: remove deprecated functions [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792622
[14:19:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] New version (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/792556 (owner: 10Giuseppe Lavagetto)
[14:19:31] <logmsgbot>	 !log hnowlan@deploy1002 Finished deploy [restbase/deploy@6e39559]: Add kcgwiki - T305281 (duration: 119m 34s)
[14:19:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27862 and previous config saved to /var/cache/conftool/dbconfig/20220517-141936-ladsgroup.json
[14:19:36] <stashbot>	 T305281: Post-creation work for kcgwiki - https://phabricator.wikimedia.org/T305281
[14:19:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "This seems straightforward and good!" [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah)
[14:20:40] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM if the templates have been all updated" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/792622 (owner: 10Ayounsi)
[14:21:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[14:21:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P27863 and previous config saved to /var/cache/conftool/dbconfig/20220517-142411-ladsgroup.json
[14:24:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:00] <wikibugs>	 (03PS3) 10Hashar: Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947)
[14:25:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Json schema from Gerrit Java event classes [software/gerrit/jsonschemagenerator] - 10https://gerrit.wikimedia.org/r/791642 (https://phabricator.wikimedia.org/T304947) (owner: 10Hashar)
[14:25:37] <wikibugs>	 (03PS1) 10Cathal Mooney: VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989)
[14:26:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[14:28:28] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10elukey)
[14:29:26] <wikibugs>	 (03CR) 10BCornwall: cli: Add support for XDG Base Directory spec (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall)
[14:30:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10akosiaris) >>! In T306649#7934722, @elukey wrote: > I merged two changes for the ml-serve-eqiad cluster, and now the concerns ex...
[14:30:45] <wikibugs>	 (03PS11) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067
[14:32:46] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Interface automation: fail on duplicate cable ID [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/789089 (owner: 10Ayounsi)
[14:33:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[14:33:35] <wikibugs>	 (03CR) 10Majavah: openstack: Make enc api enforce keystone policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah)
[14:33:38] <wikibugs>	 (03Merged) 10jenkins-bot: Interface automation: fail on duplicate cable ID [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/789089 (owner: 10Ayounsi)
[14:34:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P27864 and previous config saved to /var/cache/conftool/dbconfig/20220517-143441-ladsgroup.json
[14:34:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:54] <wikibugs>	 (03PS2) 10Cathal Mooney: VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989)
[14:34:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance
[14:34:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1164.eqiad.wmnet with reason: Maintenance
[14:34:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:04] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 04-1] "I'm ignoring this pending a more coherent plan about how to host Horizon generally. LMK if I've misunderstood and this is relevant to some" [puppet] - 10https://gerrit.wikimedia.org/r/781950 (https://phabricator.wikimedia.org/T305453) (owner: 10Majavah)
[14:35:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney)
[14:35:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] delete expired ldap-labs certificates [puppet] - 10https://gerrit.wikimedia.org/r/791674 (owner: 10Dzahn)
[14:37:50] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Machine-Learning-Team (Active Tasks): Requesting access to the deployment POSIX group for aikochou and kevinbazira - https://phabricator.wikimedia.org/T308308 (10elukey)
[14:39:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T300774)', diff saved to https://phabricator.wikimedia.org/P27865 and previous config saved to /var/cache/conftool/dbconfig/20220517-143916-ladsgroup.json
[14:39:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:22] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[14:40:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] openstack: Make enc api enforce keystone policy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792619 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah)
[14:41:11] <wikibugs>	 (03PS3) 10Cathal Mooney: VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989)
[14:41:20] <wikibugs>	 (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625
[14:42:09] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh)
[14:44:15] <wikibugs>	 (03CR) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[14:44:31] <wikibugs>	 (03CR) 10Muehlenhoff: "Both have signed a volunteer NDA after leaving, but if they are completely inactive at this point, we should also drop the rest of their a" [puppet] - 10https://gerrit.wikimedia.org/r/792612 (owner: 10Andrew Bogott)
[14:45:05] <wikibugs>	 (03PS12) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067
[14:45:16] <wikibugs>	 (03PS4) 10Cathal Mooney: VRF element additions for cloudsw extention to row E/F [homer/public] - 10https://gerrit.wikimedia.org/r/792624 (https://phabricator.wikimedia.org/T304989)
[14:47:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond)
[14:49:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27867 and previous config saved to /var/cache/conftool/dbconfig/20220517-144946-ladsgroup.json
[14:49:48] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[14:49:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[14:49:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:49:51] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[14:49:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:49:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T303603)', diff saved to https://phabricator.wikimedia.org/P27868 and previous config saved to /var/cache/conftool/dbconfig/20220517-144959-ladsgroup.json
[14:50:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:25] <wikibugs>	 (03CR) 10Physikerwelt: "could you paste a link to the fixed histograms if deployed?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625 (owner: 10PipelineBot)
[14:53:03] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] icinga: remove creds for a couple of departed WMCS SREs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792612 (owner: 10Andrew Bogott)
[14:53:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks! Merging." [puppet] - 10https://gerrit.wikimedia.org/r/792609 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[14:53:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] visualdiff: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792609 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[14:54:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T303603)', diff saved to https://phabricator.wikimedia.org/P27869 and previous config saved to /var/cache/conftool/dbconfig/20220517-145406-ladsgroup.json
[14:54:10] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:54:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:34] <wikibugs>	 (03PS13) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check [puppet] - 10https://gerrit.wikimedia.org/r/787067
[14:56:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "There's currently one remaining email address which isn't @wikimedia.org or found in the task description of https://phabricator.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/792608 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[14:57:46] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[14:59:09] <wikibugs>	 (03CR) 10Muehlenhoff: icinga: remove creds for a couple of departed WMCS SREs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792612 (owner: 10Andrew Bogott)
[14:59:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/789876 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan)
[15:00:59] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/792607 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[15:01:57] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[15:03:22] <wikibugs>	 (03PS2) 10JMeybohm: Remove null creationTimestamp from CRDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/792267 (https://phabricator.wikimedia.org/T306165)
[15:03:24] <wikibugs>	 (03PS6) 10JMeybohm: Replace kubeyaml with kubeconform (if available) [deployment-charts] - 10https://gerrit.wikimedia.org/r/791794 (https://phabricator.wikimedia.org/T306165)
[15:03:26] <wikibugs>	 (03PS4) 10JMeybohm: Add a rake task to generate JSON schema for chart CRDs on the fly [deployment-charts] - 10https://gerrit.wikimedia.org/r/792577 (https://phabricator.wikimedia.org/T306165)
[15:04:56] <wikibugs>	 (03PS4) 10Jbond: redfish: update signature of requests method with kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595
[15:05:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] redfish: update signature of requests method with kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond)
[15:07:33] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] cli: Add support for XDG Base Directory spec (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/791459 (owner: 10BCornwall)
[15:09:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27870 and previous config saved to /var/cache/conftool/dbconfig/20220517-150911-ladsgroup.json
[15:09:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:10:17] <wikibugs>	 (03PS1) 10Jbond: Revert "delete expired ldap-labs certificates" [puppet] - 10https://gerrit.wikimedia.org/r/792482
[15:11:24] <wikibugs>	 (03Abandoned) 10Jbond: Revert "delete expired ldap-labs certificates" [puppet] - 10https://gerrit.wikimedia.org/r/792482 (owner: 10Jbond)
[15:12:48] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: update signature of requests method with kwargs [software/spicerack] - 10https://gerrit.wikimedia.org/r/792595 (owner: 10Jbond)
[15:13:18] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:13:31] <wikibugs>	 (03PS1) 10BCornwall: "admin: Re-add user "brett" to ops group"" [puppet] - 10https://gerrit.wikimedia.org/r/792483
[15:13:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "Not ready yet" [puppet] - 10https://gerrit.wikimedia.org/r/792568 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[15:14:20] <wikibugs>	 (03PS2) 10BCornwall: "admin: Re-add user "brett" to ops group"" [puppet] - 10https://gerrit.wikimedia.org/r/792483
[15:15:20] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:15:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] "admin: Re-add user "brett" to ops group"" [puppet] - 10https://gerrit.wikimedia.org/r/792483 (owner: 10BCornwall)
[15:16:54] <wikibugs>	 (03PS1) 10Ssingh: durum: return the site/DC in the check response [puppet] - 10https://gerrit.wikimedia.org/r/792635
[15:17:28] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.324 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:17:44] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48108 bytes in 0.219 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:17:53] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35309/console" [puppet] - 10https://gerrit.wikimedia.org/r/792635 (owner: 10Ssingh)
[15:20:38] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: return the site/DC in the check response [puppet] - 10https://gerrit.wikimedia.org/r/792635 (owner: 10Ssingh)
[15:22:46] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[15:24:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P27871 and previous config saved to /var/cache/conftool/dbconfig/20220517-152416-ladsgroup.json
[15:24:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] icinga: remove creds for a couple of departed WMCS SREs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792612 (owner: 10Andrew Bogott)
[15:32:59] <wikibugs>	 (03PS1) 10Ssingh: durum: set site to null when Wikidough is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/792638
[15:33:51] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35310/console" [puppet] - 10https://gerrit.wikimedia.org/r/792638 (owner: 10Ssingh)
[15:34:33] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Deploy TOC A/B test to pilot wikis except frwiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming)
[15:34:36] <wikibugs>	 (03PS4) 10Jdlrobson: Deploy TOC A/B test to pilot wikis except frwiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming)
[15:36:49] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: set site to null when Wikidough is not enabled [puppet] - 10https://gerrit.wikimedia.org/r/792638 (owner: 10Ssingh)
[15:38:02] <icinga-wm>	 PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:39:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T303603)', diff saved to https://phabricator.wikimedia.org/P27872 and previous config saved to /var/cache/conftool/dbconfig/20220517-153921-ladsgroup.json
[15:39:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
[15:39:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance
[15:39:27] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[15:39:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance
[15:39:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:32] <wikibugs>	 (03PS3) 10Ssingh: "admin: Re-add user "brett" to ops group"" [puppet] - 10https://gerrit.wikimedia.org/r/792483 (owner: 10BCornwall)
[15:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance
[15:39:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:39:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:51] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[15:40:52] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[15:40:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[15:43:05] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[15:43:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T303603)', diff saved to https://phabricator.wikimedia.org/P27873 and previous config saved to /var/cache/conftool/dbconfig/20220517-154310-ladsgroup.json
[15:43:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:46] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic: Betacommons: 504, Connection Timed Out at 2022-05-02 13:35:16 GMT - https://phabricator.wikimedia.org/T307354 (10AlexisJazz) Right now it works, as usual with these it was a transient error.
[15:45:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T303603)', diff saved to https://phabricator.wikimedia.org/P27874 and previous config saved to /var/cache/conftool/dbconfig/20220517-154502-ladsgroup.json
[15:45:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:45:08] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[15:45:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10KartikMistry)
[15:52:01] <icinga-wm>	 RECOVERY - Disk space on ms-be1040 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1040&var-datasource=eqiad+prometheus/ops
[15:57:35] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:00:05] <jouncebot>	 jbond and rzl: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:22:42] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Move Hadoop eventlogs cleanup to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/792116 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[16:27:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[16:27:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[16:27:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T298555)', diff saved to https://phabricator.wikimedia.org/P27875 and previous config saved to /var/cache/conftool/dbconfig/20220517-162738-ladsgroup.json
[16:27:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:45] <stashbot>	 T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555
[16:28:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Manual repool', diff saved to https://phabricator.wikimedia.org/P27876 and previous config saved to /var/cache/conftool/dbconfig/20220517-162835-ladsgroup.json
[16:28:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[16:30:19] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance
[16:30:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27877 and previous config saved to /var/cache/conftool/dbconfig/20220517-163024-ladsgroup.json
[16:30:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:30] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[16:34:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27878 and previous config saved to /var/cache/conftool/dbconfig/20220517-163446-ladsgroup.json
[16:34:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:00] <wikibugs>	 (03PS1) 10Jbond: hiera_export: add unmanaged (mostly) network devices [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/792644
[16:48:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "looks good to me. thanks! (random comment: I always read this as "lion update". Can't help it:)" [puppet] - 10https://gerrit.wikimedia.org/r/792121 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[16:49:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27880 and previous config saved to /var/cache/conftool/dbconfig/20220517-164951-ladsgroup.json
[16:49:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P27881 and previous config saved to /var/cache/conftool/dbconfig/20220517-170456-ladsgroup.json
[17:05:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:07:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff)
[17:07:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10MoritzMuehlenhoff) ganeti4003 is from the same batch and needs the same updates. I've migrated instances, removed it from the cluster for the reimage and downtimed it.
[17:08:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti4003.ulsfo.wmnet with reason: Remove from cluster for eventual reimage
[17:08:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti4003.ulsfo.wmnet with reason: Remove from cluster for eventual reimage
[17:08:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:32] <wikibugs>	 (03PS1) 10David Caro: wmcs-image-create: Remove puppet cron on the template image [puppet] - 10https://gerrit.wikimedia.org/r/792669
[17:12:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable ganeti4004 as Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/792670
[17:13:01] <wikibugs>	 (03PS2) 10David Caro: wmcs-image-create: Remove puppet cron on the template image [puppet] - 10https://gerrit.wikimedia.org/r/792669
[17:16:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) Hey @Dzahn my apologies for the delay. I just completed the first two steps:    - ITS conf...
[17:20:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T303603)', diff saved to https://phabricator.wikimedia.org/P27882 and previous config saved to /var/cache/conftool/dbconfig/20220517-172001-ladsgroup.json
[17:20:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:07] <stashbot>	 T303603: Add actor and comment columns to cu_changes - https://phabricator.wikimedia.org/T303603
[17:24:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for turnilo/superset staging on Bullseye - https://phabricator.wikimedia.org/T306213 (10razzi) I'm going to go ahead and put this on row A. Here's a little snippet I used to look at the ganeti resource totals by row (`python -m pip instal...
[17:25:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[17:25:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance
[17:25:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T300774)', diff saved to https://phabricator.wikimedia.org/P27883 and previous config saved to /var/cache/conftool/dbconfig/20220517-172521-ladsgroup.json
[17:25:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:29] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[17:26:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T300774)', diff saved to https://phabricator.wikimedia.org/P27884 and previous config saved to /var/cache/conftool/dbconfig/20220517-172632-ladsgroup.json
[17:26:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:37] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[17:28:41] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.029 second response time https://wikitech.wikimedia.org/wiki/Swift
[17:30:24] <wikibugs>	 (03PS1) 10Ssingh: durum: display the DC the user is connected to in the frontend [puppet] - 10https://gerrit.wikimedia.org/r/792676
[17:31:09] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35311/console" [puppet] - 10https://gerrit.wikimedia.org/r/792676 (owner: 10Ssingh)
[17:34:09] <wikibugs>	 (03PS2) 10Jforrester: TimedMediaHandler: Disabled the BetaFeature from wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/788385 (https://phabricator.wikimedia.org/T248418)
[17:38:43] <icinga-wm>	 RECOVERY - Host an-tool1005 is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms
[17:43:27] <icinga-wm>	 PROBLEM - SSH on an-tool1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:47:13] <icinga-wm>	 RECOVERY - SSH on an-tool1005 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:53:07] <icinga-wm>	 PROBLEM - Host an-tool1005 is DOWN: PING CRITICAL - Packet loss = 100%
[17:54:00] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:55:47] <icinga-wm>	 RECOVERY - Host an-tool1005 is UP: PING OK - Packet loss = 0%, RTA = 0.59 ms
[17:58:09] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.ganeti.makevm for new host an-tool1011.eqiad.wmnet
[17:58:10] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.dns.netbox
[17:58:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:58:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:00:11] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:02:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q4:(Need By: TBD) rack/setup/install wqds101[4,5,6] - https://phabricator.wikimedia.org/T307138 (10Jclark-ctr)
[18:04:28] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Jclark-ctr)
[18:04:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Jclark-ctr)
[18:06:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install cloudvirt105[123].eqiad.wmnet - https://phabricator.wikimedia.org/T305194 (10Jclark-ctr)
[18:08:41] <wikibugs>	 (03PS1) 10Razzi: dhcpd: make an-tool1005 use debian 10 [puppet] - 10https://gerrit.wikimedia.org/r/792686 (https://phabricator.wikimedia.org/T308597)
[18:08:45] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Jclark-ctr)
[18:09:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering: Q4:(Need By: TBD) rack/setup/install an-presto10[06-15].eqiad.wmnet - https://phabricator.wikimedia.org/T306835 (10Jclark-ctr)
[18:13:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Jclark-ctr)
[18:15:25] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] dhcpd: make an-tool1005 use debian 10 [puppet] - 10https://gerrit.wikimedia.org/r/792686 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi)
[18:16:58] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:17:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:09] <icinga-wm>	 PROBLEM - Host an-tool1005 is DOWN: PING CRITICAL - Packet loss = 100%
[18:22:07] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:26:15] <icinga-wm>	 RECOVERY - Host an-tool1005 is UP: PING OK - Packet loss = 0%, RTA = 1.53 ms
[18:26:37] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-tool1011.eqiad.wmnet
[18:26:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:27] <icinga-wm>	 PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.20.24: Connection reset by peer https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:40:38] <icinga-wm>	 RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:43:16] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 #page on db1156 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 21585.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[18:43:35] <Amir1>	 here
[18:43:38] <Amir1>	 that's me
[18:43:40] <jynus>	 that is a big gap, is that a depooled host under mainteinace?
[18:43:42] <marostegui>	 uh?
[18:43:44] <Amir1>	 the depool time wasn't enough
[18:43:53] <Amir1>	 the downtime
[18:43:55] <jynus>	 downtime?
[18:43:58] <jynus>	 ah, cool
[18:43:58] <marostegui>	 ah ok
[18:43:59] <Amir1>	 yeah
[18:44:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10RobH) updates all done, system is back up for reimage whenever
[18:44:05] <jynus>	 so it is depooled, right?
[18:44:12] <slyngs>	 Don't scare people like that :-)
[18:44:26] <marostegui>	 yes, it is depooled
[18:44:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/ulsfo to Bullseye - https://phabricator.wikimedia.org/T307997 (10RobH) a:05RobH→03MoritzMuehlenhoff
[18:44:38] <Amir1>	 my bad
[18:44:46] <Amir1>	 resolved
[18:45:02] <Amir1>	 let me downtime it for two more hours
[18:45:20] <jynus>	 for people on call, https://grafana.wikimedia.org/d/000000278/mysql-aggregated and https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard are good dashboards to double check no impact 
[18:45:20] <marostegui>	 Amir1: give it 4 just in case XD
[18:45:47] <wikibugs>	 (03CR) 10Zabe: wikilabels: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792608 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[18:46:00] <bblack>	 hey
[18:46:32] <bblack>	 oh ok, resolved already :)
[18:46:41] <jhathaway>	 bblack: yup!
[18:46:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1156.eqiad.wmnet with reason: Maint
[18:46:45] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1156.eqiad.wmnet with reason: Maint
[18:46:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:58] <Amir1>	 downtimed for five more hours ^
[18:47:10] <volans>	 if nobody has acked yet on VO I suggest to do that though
[18:47:21] <Amir1>	 I did
[18:48:08] <volans>	 thx
[18:48:39] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:51:40] <wikibugs>	 (03PS1) 10Zabe: varnishkafka: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792692 (https://phabricator.wikimedia.org/T308013)
[18:54:20] <Amir1>	 sorry, alter table on revision table takes long but more than six hours on s2? That was a bit unexpected 
[18:55:36] <jynus>	 Amir1: assuming you are working now (ignore me if not) there is some weird pattern for uncached traffic since a few hours ago
[18:55:39] <wikibugs>	 (03PS1) 10Zabe: vagrant: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792693 (https://phabricator.wikimedia.org/T308013)
[18:55:53] <jynus>	 some if it must be just the alter (extra db writes)
[18:55:53] <Amir1>	 jynus: where is it?
[18:56:04] <jynus>	 but some may not be explained by it
[18:56:10] <Amir1>	 I'm cleaning up x1  as well
[18:56:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] vagrant: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/792693 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[18:56:25] <jynus>	 Amir1: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard
[18:57:09] <jynus>	 let me set the time so it is clearer
[18:57:23] <jynus>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1652209031443&to=1652813831443
[18:57:35] <Amir1>	 which panel?
[18:57:56] <jynus>	 a few- first regular 200 get requests
[18:58:10] <jynus>	 which wouldn't be too worrying as that would be just traffic-related
[18:58:26] <jynus>	 and the db ones would be explained by schema changes
[18:58:31] <jynus>	 but see mcrouter
[18:58:37] <Amir1>	 I don't know the dip but the pattern looks normali-sih
[18:58:52] <jynus>	 that means a performance issue- more parsings than usual
[18:59:05] <wikibugs>	 (03PS1) 10Zabe: vagrant: add shebang to alias-vagrant-profile-d.sh [puppet] - 10https://gerrit.wikimedia.org/r/792694
[18:59:06] <jynus>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1652209031443&to=1652813831443
[18:59:22] <jynus>	 seems back to normal now
[18:59:25] <Amir1>	 I think there is someone parsing stuff with 100 req/s
[18:59:34] <jynus>	 yeah, that would explain it
[18:59:49] <jynus>	 as long as it is external-triggered no issue
[19:00:04] <wikibugs>	 (03CR) 10Zabe: vagrant: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/792693 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe)
[19:00:52] <jynus>	 will keep an eye on that tomorrow
[19:00:55] <jynus>	 leaving for now
[19:00:59] <Amir1>	 have fun!
[19:01:12] <jynus>	 as the effect is only a slight perf increase, nothing to crazy
[19:01:36] <jynus>	 well, perf decrease, latency increase :-)
[19:01:57] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[19:01:58] <jynus>	 have a nice day!
[19:07:32] <wikibugs>	 (03PS1) 10BryanDavis: toolhub: Bump container version to 2022-05-17-072641-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/792696 (https://phabricator.wikimedia.org/T303909)
[19:07:33] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:08:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson frdb1005            c1.  u3.    port;   4 , 4          cableid#    2945 , 4042 frdev1003           c1  u4...
[19:10:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frdb1005, frdev1003 - https://phabricator.wikimedia.org/T306935 (10Jclark-ctr)
[19:15:02] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] durum: display the DC the user is connected to in the frontend [puppet] - 10https://gerrit.wikimedia.org/r/792676 (owner: 10Ssingh)
[19:18:58] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container version to 2022-05-17-072641-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/792696 (https://phabricator.wikimedia.org/T303909) (owner: 10BryanDavis)
[19:21:47] <wikibugs>	 (03PS1) 10Majavah: nrpe: add nrpe::script to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700
[19:22:23] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::wmcs::instance: create nrpe plugin directory [puppet] - 10https://gerrit.wikimedia.org/r/792701
[19:22:41] <wikibugs>	 (03PS2) 10Majavah: nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700
[19:22:46] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[19:24:06] <wikibugs>	 (03Merged) 10jenkins-bot: toolhub: Bump container version to 2022-05-17-072641-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/792696 (https://phabricator.wikimedia.org/T303909) (owner: 10BryanDavis)
[19:25:08] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35312/console" [puppet] - 10https://gerrit.wikimedia.org/r/792700 (owner: 10Majavah)
[19:25:58] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply
[19:26:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:58] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply
[19:27:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:27:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700 (owner: 10Majavah)
[19:28:08] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply
[19:28:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:32] <wikibugs>	 (03PS4) 10Ssingh: admin: Re-add user "brett" to ops group [puppet] - 10https://gerrit.wikimedia.org/r/792483 (owner: 10BCornwall)
[19:28:57] <wikibugs>	 (03PS3) 10Majavah: nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700
[19:29:08] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Jclark-ctr) frlog1002 C1 U37   port; 2 , 2      cableid# 23000047 , 23000061
[19:29:55] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:29:55] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply
[19:29:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:42] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] admin: Re-add user "brett" to ops group [puppet] - 10https://gerrit.wikimedia.org/r/792483 (owner: 10BCornwall)
[19:31:52] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35313/console" [puppet] - 10https://gerrit.wikimedia.org/r/792700 (owner: 10Majavah)
[19:32:15] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:32:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700 (owner: 10Majavah)
[19:33:25] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply
[19:33:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:34:55] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply
[19:34:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:35:24] <wikibugs>	 (03CR) 10Herron: "Nice! looks good, thanks for putting it together" [alerts] - 10https://gerrit.wikimedia.org/r/792564 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[19:35:39] <wikibugs>	 (03PS2) 10Andrew Bogott: profile::wmcs::instance: create nrpe plugin directory [puppet] - 10https://gerrit.wikimedia.org/r/792701 (https://phabricator.wikimedia.org/T308601)
[19:35:54] <wikibugs>	 (03PS4) 10Andrew Bogott: nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah)
[19:38:27] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "this would need /usr/lib/nagios/plugins/ as well to be fully effective" [puppet] - 10https://gerrit.wikimedia.org/r/792701 (https://phabricator.wikimedia.org/T308601) (owner: 10Andrew Bogott)
[19:38:38] <wikibugs>	 (03PS5) 10Majavah: nrpe: add nrpe::plugin to only installs scripts to hosts with nrpe [puppet] - 10https://gerrit.wikimedia.org/r/792700 (https://phabricator.wikimedia.org/T308601)
[19:38:40] <wikibugs>	 (03PS1) 10Majavah: base::firewall: migrate to nrpe::plugin [puppet] - 10https://gerrit.wikimedia.org/r/792705 (https://phabricator.wikimedia.org/T308601)
[19:40:11] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35315/console" [puppet] - 10https://gerrit.wikimedia.org/r/792705 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah)
[19:40:24] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35314/console" [puppet] - 10https://gerrit.wikimedia.org/r/792700 (https://phabricator.wikimedia.org/T308601) (owner: 10Majavah)
[19:41:15] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Jclark-ctr)
[19:41:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[19:41:52] <wikibugs>	 10SRE, 10DC-Ops: Confirm support of PERC 750 raid controller - https://phabricator.wikimedia.org/T297913 (10Jclark-ctr)
[19:44:07] <bd808>	 !log Updated Toolhub to 42072d, applied db migrations, and rebuilt search indexes
[19:44:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:58:17] <wikibugs>	 (03PS1) 10Ssingh: test_dns: update DNS/durum test to reflect changes in API [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/792706
[19:59:09] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] test_dns: update DNS/durum test to reflect changes in API [software/knead-wikidough] - 10https://gerrit.wikimedia.org/r/792706 (owner: 10Ssingh)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, and cjming: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220517T2000).
[20:00:05] <jouncebot>	 cjming: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:35] <cjming>	 i'm the only one so i'll deploy my own patch
[20:01:03] <cjming>	 and wait around for a few before closing window
[20:01:19] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/792707 (https://phabricator.wikimedia.org/T301312)
[20:01:34] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Deploy TOC A/B test to pilot wikis except frwiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming)
[20:02:34] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy TOC A/B test to pilot wikis except frwiki, ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792272 (https://phabricator.wikimedia.org/T306607) (owner: 10Clare Ming)
[20:03:05] <wikibugs>	 (03PS1) 10Ladsgroup: db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/792708 (https://phabricator.wikimedia.org/T301312)
[20:04:19] <wikibugs>	 (03PS1) 10Ladsgroup: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/792709 (https://phabricator.wikimedia.org/T301312)
[20:05:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:05:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:15] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/792709 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup)
[20:05:24] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-2] db1118: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/792708 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup)
[20:05:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-2] mariadb: Promote db1163 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/792707 (https://phabricator.wikimedia.org/T301312) (owner: 10Ladsgroup)
[20:06:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:06:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:06:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:56] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:06:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:07:26] <wikibugs>	 (03PS1) 10Stang: betawikiversity: HIDPI support for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792710 (https://phabricator.wikimedia.org/T308604)
[20:08:46] <koi>	 Hi cjming, is it still ok to deploy?
[20:09:17] <cjming>	 hi koi: sure - i'm just finishing up my patch
[20:09:27] <cjming>	 will do yours here shortly
[20:09:35] <koi>	 ack and thanks
[20:10:15] <wikibugs>	 (03PS3) 10Andrew Bogott: profile::wmcs::instance: create nrpe plugin directory [puppet] - 10https://gerrit.wikimedia.org/r/792701 (https://phabricator.wikimedia.org/T308601)
[20:10:57] <wikibugs>	 (03PS1) 10Ssingh: durum: update check.js site names [puppet] - 10https://gerrit.wikimedia.org/r/792711
[20:11:46] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:792272|Deploy TOC A/B test to pilot wikis except frwiki, ptwiki (T306607)]] (duration: 00m 53s)
[20:11:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:11:51] <stashbot>	 T306607: Deploy ToC A/B test to remainder of desktop improvements pilot wikis - https://phabricator.wikimedia.org/T306607
[20:12:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:12:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:12:25] <wikibugs>	 (03PS2) 10Clare Ming: betawikiversity: HIDPI support for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792710 (https://phabricator.wikimedia.org/T308604) (owner: 10Stang)
[20:13:29] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] betawikiversity: HIDPI support for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792710 (https://phabricator.wikimedia.org/T308604) (owner: 10Stang)
[20:13:39] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] durum: update check.js site names [puppet] - 10https://gerrit.wikimedia.org/r/792711 (owner: 10Ssingh)
[20:14:19] <wikibugs>	 (03Merged) 10jenkins-bot: betawikiversity: HIDPI support for logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792710 (https://phabricator.wikimedia.org/T308604) (owner: 10Stang)
[20:15:49] <cjming>	 koi: can you check changes on mwdebug1001?
[20:16:12] <koi>	 looking
[20:17:09] <koi>	 LGTM
[20:17:18] <cjming>	 great - syncing now
[20:18:24] <wikibugs>	 (03PS1) 10Razzi: site: add an-tool1011 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/792712 (https://phabricator.wikimedia.org/T308597)
[20:18:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:18:48] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:18:50] <logmsgbot>	 !log cjming@deploy1002 Synchronized static/images/project-logos/betawikiversity.png: Config: [[gerrit:792710|betawikiversity: HIDPI support for logo (T308604)]] (duration: 00m 54s)
[20:18:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:18:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:00] <stashbot>	 T308604: Optimize Logo of Beta Wikiversity - https://phabricator.wikimedia.org/T308604
[20:19:50] <logmsgbot>	 !log cjming@deploy1002 Synchronized static/images/project-logos/betawikiversity-1.5x.png: Config: [[gerrit:792710|betawikiversity: HIDPI support for logo (T308604)]] (duration: 00m 56s)
[20:19:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:46] <logmsgbot>	 !log cjming@deploy1002 Synchronized static/images/project-logos/betawikiversity-2x.png: Config: [[gerrit:792710|betawikiversity: HIDPI support for logo (T308604)]] (duration: 00m 53s)
[20:20:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:41] <logmsgbot>	 !log cjming@deploy1002 Synchronized logos/config.yaml: Config: [[gerrit:792710|betawikiversity: HIDPI support for logo (T308604)]] (duration: 00m 52s)
[20:21:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:41] <logmsgbot>	 !log cjming@deploy1002 Synchronized wmf-config/logos.php: Config: [[gerrit:792710|betawikiversity: HIDPI support for logo (T308604)]] (duration: 00m 53s)
[20:22:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:52] <cjming>	 koi: your changes should be live
[20:22:59] <koi>	 thanks!
[20:23:03] <cjming>	 np!
[20:25:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:25:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:25:33] <cjming>	 !log end of UTC late backport & config window
[20:25:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:19] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] site: add an-tool1011 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/792712 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi)
[20:30:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:31:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:33:00] <wikibugs>	 (03CR) 10Dzahn: [V: 03+2 C: 03+2] "spot checked a couple of the certs. looks good. usually people don't even create "real fake" certs and just put "placeholder" or "snake oi" [labs/private] - 10https://gerrit.wikimedia.org/r/791667 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans)
[20:34:46] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:34:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:35:17] <wikibugs>	 (03PS1) 10Razzi: install_server: add an-tool1011 as virtual [puppet] - 10https://gerrit.wikimedia.org/r/792718 (https://phabricator.wikimedia.org/T308597)
[20:36:49] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Dmantena) @RLazarus Sorry for re-opening this task, but while it appears I have Superset access, it doesn't appear I have SQL/Presto access to be able to view the analytics data I was after. Here'...
[20:37:50] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10RhinosF1) 05Resolved→03Open
[20:39:31] <wikibugs>	 (03PS1) 10Bking: elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606)
[20:39:51] <wikibugs>	 (03CR) 10Razzi: [C: 03+2] install_server: add an-tool1011 as virtual [puppet] - 10https://gerrit.wikimedia.org/r/792718 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi)
[20:40:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Krinkle)
[20:41:07] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10RhinosF1) I've left a message with Analytics to check but based on https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I_request?, I think this may need shell access / a p...
[20:42:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] elastic: add reimage to rolling-operation [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[20:43:43] <icinga-wm>	 RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:45:20] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10Milimetric) Indeed, RhinosF1 is right, take a look at that link and I believe you need analytics-privatedata-users to run queries and access Presto-backed dashboards
[20:50:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T298560)', diff saved to https://phabricator.wikimedia.org/P27888 and previous config saved to /var/cache/conftool/dbconfig/20220517-205030-ladsgroup.json
[20:50:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:50:36] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for `Dmantena` - https://phabricator.wikimedia.org/T308294 (10RhinosF1) 05Open→03Resolved @DMantena: Can you file a new task using https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ or copy the information from that form into this task?  A bit...
[20:50:37] <stashbot>	 T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560
[20:52:06] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 #page on db1156 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[20:53:07] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "This is a perfectly reasonable refactor -- I'm going to merge it right now so that I can add another line on top." [puppet] - 10https://gerrit.wikimedia.org/r/792669 (owner: 10David Caro)
[20:57:21] <wikibugs>	 (03PS1) 10Ebernhardson: Resolve minimum_should_match warnings during random scoring [extensions/CirrusSearch] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792649 (https://phabricator.wikimedia.org/T288765)
[20:57:38] <wikibugs>	 (03PS1) 10Ebernhardson: haslicense: Apply minimum_should_match for elastic 7.x [extensions/WikibaseCirrusSearch] (wmf/1.39.0-wmf.12) - 10https://gerrit.wikimedia.org/r/792650 (https://phabricator.wikimedia.org/T288765)
[20:59:33] <icinga-wm>	 PROBLEM - Check size of conntrack table on an-tool1005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[21:00:29] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:01:13] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.36.117: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:01:51] <icinga-wm>	 RECOVERY - Check size of conntrack table on an-tool1005 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[21:02:13] <icinga-wm>	 RECOVERY - puppet last run on an-tool1005 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[21:03:19] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-image-create.py: Inject a couple of nagios plugin dirs into our image [puppet] - 10https://gerrit.wikimedia.org/r/792721 (https://phabricator.wikimedia.org/T308601)
[21:03:31] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:04:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wmcs-image-create.py: Inject a couple of nagios plugin dirs into our image [puppet] - 10https://gerrit.wikimedia.org/r/792721 (https://phabricator.wikimedia.org/T308601) (owner: 10Andrew Bogott)
[21:05:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) 05Stalled→03Open
[21:05:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn)
[21:05:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn)
[21:05:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P27889 and previous config saved to /var/cache/conftool/dbconfig/20220517-210535-ladsgroup.json
[21:05:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:05:39] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) 05Open→03In progress
[21:09:55] <wikibugs>	 (03PS2) 10Andrew Bogott: wmcs-image-create.py: Inject a couple of nagios plugin dirs into our image [puppet] - 10https://gerrit.wikimedia.org/r/792721 (https://phabricator.wikimedia.org/T308601)
[21:10:01] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:14:50] <wikibugs>	 (03CR) 10Volans: "generic comment inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/792719 (https://phabricator.wikimedia.org/T308606) (owner: 10Bking)
[21:20:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P27890 and previous config saved to /var/cache/conftool/dbconfig/20220517-212040-ladsgroup.json
[21:20:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:23:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P27891 and previous config saved to /var/cache/conftool/dbconfig/20220517-212316-ladsgroup.json
[21:23:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[21:25:26] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[21:25:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T300774)', diff saved to https://phabricator.wikimedia.org/P27892 and previous config saved to /var/cache/conftool/dbconfig/20220517-212530-ladsgroup.json
[21:25:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:25:36] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[21:27:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) Hi @bcampbell I removed the donate@ alias from the mail servers right now.  I can confirm it n...
[21:28:53] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) How about "donation@" as opposed to "donate@".  Is that an alias for fundraising for for donat...
[21:33:03] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:37:30] <wikibugs>	 (03PS1) 10Razzi: turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597)
[21:37:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi)
[21:38:21] <wikibugs>	 (03PS2) 10Razzi: turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597)
[21:38:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi)
[21:43:28] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to `wmf` for Tsevener - https://phabricator.wikimedia.org/T308616 (10Tsevener)
[21:43:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T300774)', diff saved to https://phabricator.wikimedia.org/P27893 and previous config saved to /var/cache/conftool/dbconfig/20220517-214349-ladsgroup.json
[21:43:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:43:55] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[21:44:15] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 111 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:46:33] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 31 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:46:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10bcampbell) I've sent test mail from a couple different addresses, one internal and one external, and...
[21:48:17] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:52:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: 1 VM request for turnilo/superset staging on Bullseye - https://phabricator.wikimedia.org/T306213 (10razzi) 05Open→03Resolved VM created. Work continues at https://phabricator.wikimedia.org/T308597
[21:52:24] <mutante>	 !log alert1001 - systemctl start certspotter (after alert that the unit was failed. happens sometimes)
[21:52:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:54:00] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:54:19] <wikibugs>	 10SRE-tools, 10DNS, 10Infrastructure-Foundations, 10Traffic: DNS repo: add Jenkins job to ensure there are no duplicates - https://phabricator.wikimedia.org/T155761 (10Volans) I've a local patch that I'm testing to perform the validation of the whole dataset (manual + netbox). The preliminary results are b...
[21:56:13] <wikibugs>	 (03PS3) 10Razzi: turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597)
[21:58:34] <wikibugs>	 (03CR) 10Volans: hiera_export: add unmanaged (mostly) network devices (032 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/792644 (owner: 10Jbond)
[21:58:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P27894 and previous config saved to /var/cache/conftool/dbconfig/20220517-215854-ladsgroup.json
[21:58:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:00:56] <wikibugs>	 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS entries - https://phabricator.wikimedia.org/T271143 (10bking) a:03bking
[22:01:13] <wikibugs>	 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10bking)
[22:03:20] <wikibugs>	 10SRE, 10ops-codfw: Recycling Pickup for CODFW - https://phabricator.wikimedia.org/T307694 (10Papaul) Pickup and on site shred  complete .  {F35150027}
[22:04:43] <icinga-wm>	 PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:05:56] <urbanecm>	 jouncebot: nowandnext
[22:05:56] <jouncebot>	 No deployments scheduled for the next 8 hour(s) and 54 minute(s)
[22:05:56] <jouncebot>	 In 8 hour(s) and 54 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220518T0700)
[22:07:17] * urbanecm stashing at debug servers
[22:07:54] * urbanecm finished
[22:08:21] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:08:59] <wikibugs>	 (03CR) 10Brennen Bearnes: "This has been tested against an existing WMCS runner.  Works as expected.  Sample error message in failed pipeline:" [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes)
[22:09:36] * urbanecm goes to deploy a patch now
[22:10:12] <wikibugs>	 (03PS3) 10Andrew Bogott: wmcs-image-create.py: Inject a couple of nagios plugin dirs into our image [puppet] - 10https://gerrit.wikimedia.org/r/792721 (https://phabricator.wikimedia.org/T308601)
[22:10:39] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:12:54] <wikibugs>	 (03PS1) 10Urbanecm: langlist: add kcg language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792731 (https://phabricator.wikimedia.org/T305279)
[22:12:56] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] langlist: add kcg language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792731 (https://phabricator.wikimedia.org/T305279) (owner: 10Urbanecm)
[22:12:58] <wikibugs>	 (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792732
[22:13:01] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792732 (owner: 10Urbanecm)
[22:13:49] <wikibugs>	 (03Merged) 10jenkins-bot: langlist: add kcg language [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792731 (https://phabricator.wikimedia.org/T305279) (owner: 10Urbanecm)
[22:13:55] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792732 (owner: 10Urbanecm)
[22:14:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P27895 and previous config saved to /var/cache/conftool/dbconfig/20220517-221359-ladsgroup.json
[22:14:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:34] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized langlist: cd704d4f: langlist: add kcg language (T305279) (duration: 00m 53s)
[22:15:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:39] <stashbot>	 T305279: Create Wikipedia Tyap - https://phabricator.wikimedia.org/T305279
[22:16:27] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/interwiki.php: c2151b3: Update interwiki cache (duration: 00m 52s)
[22:16:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:34] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:16:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:21] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:17:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:17:27] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:18:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:18:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:14] * urbanecm done with deployment
[22:19:27] <wikibugs>	 (03CR) 10Eevans: [C: 04-1] WIP: enable cassandra encryption (aqs cluster) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/791663 (https://phabricator.wikimedia.org/T307798) (owner: 10Eevans)
[22:19:57] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:22:45] <wikibugs>	 (03PS4) 10Razzi: turnilo: move staging instance to an-tool1011 [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597)
[22:22:50] <wikibugs>	 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10bking) For clarity, client-side IPv6 connectivity to search functions in wikipedia, wikicommons, etc does not require the Elas...
[22:23:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[22:23:33] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35320/console" [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi)
[22:23:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) The following aliases have all been removed on the SRE side now:  donation@ donations@ donate@...
[22:24:01] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:24:33] <wikibugs>	 10SRE-tools, 10Discovery, 10Discovery-Search, 10Infrastructure-Foundations, 10IPv6: Some elastic hosts do not have IPv6 DNS records - https://phabricator.wikimedia.org/T271143 (10bking) 05Open→03Resolved
[22:25:03] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10fundraising-tech-ops: move donation,donate, donations (otrs, wikimania) exim aliases from SRE to ITS - https://phabricator.wikimedia.org/T297915 (10Dzahn) 05In progress→03Resolved
[22:25:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Epic: Move most (all?) exim personal aliases to WMF ITS - https://phabricator.wikimedia.org/T122144 (10Dzahn)
[22:27:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[22:27:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[22:27:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:27:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:29:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T300774)', diff saved to https://phabricator.wikimedia.org/P27896 and previous config saved to /var/cache/conftool/dbconfig/20220517-222904-ladsgroup.json
[22:29:08] <wikibugs>	 (03PS1) 10Razzi: turnilo: change an-tool1011 to use bullseye [puppet] - 10https://gerrit.wikimedia.org/r/792733 (https://phabricator.wikimedia.org/T308597)
[22:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:29:28] <stashbot>	 T300774: Drop fr_img_* columns - https://phabricator.wikimedia.org/T300774
[22:29:45] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "Now that turnilo requires Debian 11 and superset requires Debian 10, this patch moves turnilo to a newly created dedicated turnilo staging" [puppet] - 10https://gerrit.wikimedia.org/r/792724 (https://phabricator.wikimedia.org/T308597) (owner: 10Razzi)
[22:31:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[22:31:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:31:52] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gitlab runner: restrict docker images and services [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes)
[22:31:54] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install contint2002, gerrit2002 - https://phabricator.wikimedia.org/T299575 (10Dzahn) @Papaul Yea, reimaging is no problem. It's still in "insetup" and I can do it. Pick the easier option for you.
[22:32:29] <James_F>	 urbanecm: Please follow https://wikitech.wikimedia.org/wiki/Deployments/Emergencies in future for deploys outside of the deploy windows.
[22:38:25] <wikibugs>	 (03CR) 10Brennen Bearnes: gitlab runner: restrict docker images and services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/724472 (https://phabricator.wikimedia.org/T291978) (owner: 10Brennen Bearnes)
[22:38:27] <wikibugs>	 (03CR) 10Jforrester: "Will this need a change like 24a6e44a5bb3f9b30d13c9852577c3c0678bf62d too as you're switching to service-runner 3 from 2?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/792625 (owner: 10PipelineBot)
[22:44:09] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:44:31] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:44:55] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:48:19] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:59:57] <icinga-wm>	 PROBLEM - SSH on wtp1046.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:01:57] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:02:07] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:05:43] <icinga-wm>	 RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:16:47] <icinga-wm>	 PROBLEM - SSH on analytics1061.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:22:46] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:28:51] <icinga-wm>	 PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: certspotter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:29:40] <wikibugs>	 (03PS1) 10Jforrester: [shnwiki] Enable the SandboxLink extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/792737 (https://phabricator.wikimedia.org/T308623)
[23:53:59] <icinga-wm>	 RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state