[00:00:12] <icinga-wm>	 RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:44:20] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[00:47:32] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[00:50:46] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[00:53:58] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[01:40:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:43:50] <wikibugs>	 (03CR) 10MSantos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/870535 (owner: 10PipelineBot)
[01:48:32] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/870535 (owner: 10PipelineBot)
[01:50:32] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[01:50:53] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[01:51:15] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[01:52:05] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[01:52:37] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[01:53:27] <logmsgbot>	 !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[01:54:24] <mbsantos>	 !log Update verbiage of fundraising banner (T325690)
[01:54:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:54:28] <stashbot>	 T325690: Updated message for 2022 English fundraising in iOS app - https://phabricator.wikimedia.org/T325690
[01:55:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:10:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:15:38] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:20:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:37:39] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[02:46:44] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Soda) >>! In T325607#8485491, @SCherukuwada wrote: > I've mostly focused on search performance and stats for the Wikipedias and haven't had a chanc...
[03:08:20] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[03:09:56] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[03:11:26] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:26:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[03:35:14] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[03:38:28] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[04:33:21] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Seddon) Page: https://de.wikisource.org/wiki/Zedler:Puppenwerck Search: https://www.google.de/search?q=nennet+man+%C3%BCberhaupt+alles+Spielwerck...
[04:36:22] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[04:39:36] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[04:49:22] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 235 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[04:51:00] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:16:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[05:29:42] <icinga-wm>	 PROBLEM - puppet last run on maps1009 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[05:59:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:04:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:14:44] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:37:40] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:37:49] <wikibugs>	 (03PS1) 10Marostegui: analytics-meta.my.cnf.erb: : Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/870744 (https://phabricator.wikimedia.org/T325154)
[06:42:44] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[06:46:00] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[06:53:43] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] analytics-meta.my.cnf.erb: : Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/870744 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui)
[07:10:32] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:14:20] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 28398
[07:14:44] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 28398
[07:18:51] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 20485
[07:19:06] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 20485
[07:20:28] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 17806
[07:21:04] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 17806
[07:21:17] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 56286
[07:22:14] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 56286
[07:43:04] <wikibugs>	 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10Vgutierrez)
[07:43:58] <vgutierrez>	 !log restarting varnish on cp4052 to clear VarnishChildRestarted alert - T325797
[07:44:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:04] <stashbot>	 T325797: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797
[07:44:39] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10ayounsi) a:03ayounsi
[07:47:41] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10ayounsi) Assigning the task to myself to remove the router's static routes after the break.
[07:48:59] <wikibugs>	 10SRE, 10Gerrit, 10serviceops-collab: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631 (10ayounsi) Noted, thanks for the explanation!
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221222T0800)
[08:00:08] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:11:35] <moritzm>	 !log installing libksba security updates
[08:11:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for libksba [puppet] - 10https://gerrit.wikimedia.org/r/870746
[08:15:01] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10dcaro)
[08:15:41] <wikibugs>	 10ops-drmrs: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10ayounsi)
[08:16:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libksba [puppet] - 10https://gerrit.wikimedia.org/r/870746 (owner: 10Muehlenhoff)
[08:25:12] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10dcaro)
[08:31:14] <wikibugs>	 (03PS1) 10Slyngshede: WIP: Access Requests [software/bitu] - 10https://gerrit.wikimedia.org/r/870747
[08:34:25] <wikibugs>	 (03CR) 10David Caro: Add moved to wmcs-cookbooks message. (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (owner: 10David Caro)
[08:34:48] <wikibugs>	 (03PS3) 10David Caro: Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (https://phabricator.wikimedia.org/T319401)
[08:35:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (https://phabricator.wikimedia.org/T319401) (owner: 10David Caro)
[08:36:03] <wikibugs>	 (03PS2) 10David Caro: alertmanager: format a bit nicer the default args [puppet] - 10https://gerrit.wikimedia.org/r/868634 (https://phabricator.wikimedia.org/T323714)
[08:36:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] alertmanager: format a bit nicer the default args [puppet] - 10https://gerrit.wikimedia.org/r/868634 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro)
[08:36:15] <wikibugs>	 (03PS2) 10David Caro: karma: add metrcsinfra alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714)
[08:36:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] karma: add metrcsinfra alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro)
[08:36:55] <wikibugs>	 (03Abandoned) 10David Caro: Revert "cumin: add an audit report for insetup servers" [puppet] - 10https://gerrit.wikimedia.org/r/866699 (owner: 10David Caro)
[09:22:35] <wikibugs>	 (03PS7) 10Slyngshede: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021
[09:25:53] <wikibugs>	 (03PS5) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582)
[09:27:09] <wikibugs>	 (03CR) 10Slyngshede: Signup and LDAP flow. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede)
[09:27:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[09:29:23] <wikibugs>	 (03PS1) 10Muehlenhoff: os-updates-report: Allow passing an additional owners file [puppet] - 10https://gerrit.wikimedia.org/r/870749
[09:34:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] os-updates-report: Allow passing an additional owners file [puppet] - 10https://gerrit.wikimedia.org/r/870749 (owner: 10Muehlenhoff)
[09:38:31] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1013.eqiad.wmnet with OS bullseye
[09:38:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1013...
[09:39:47] <wikibugs>	 (03CR) 10Ayounsi: "Some comments but overall this goes in the good direction!" [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[09:40:15] <wikibugs>	 (03PS1) 10Muehlenhoff: os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750
[09:40:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750 (owner: 10Muehlenhoff)
[09:41:23] <wikibugs>	 (03PS2) 10Muehlenhoff: os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750
[09:42:29] <wikibugs>	 (03PS3) 10Muehlenhoff: os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750
[09:44:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750 (owner: 10Muehlenhoff)
[09:44:28] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM but let's not deploy it before the end of year break." [homer/public] - 10https://gerrit.wikimedia.org/r/869736 (owner: 10Majavah)
[09:46:14] <wikibugs>	 (03PS4) 10Muehlenhoff: os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750
[09:50:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750 (owner: 10Muehlenhoff)
[09:55:58] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Clarify unix_socket entries [puppet] - 10https://gerrit.wikimedia.org/r/870751 (https://phabricator.wikimedia.org/T325154)
[09:58:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Clarify unix_socket entries [puppet] - 10https://gerrit.wikimedia.org/r/870751 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui)
[10:06:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a license statement to puppet.git with an overview [puppet] - 10https://gerrit.wikimedia.org/r/870813
[10:09:27] <wikibugs>	 (03PS6) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582)
[10:10:16] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:20:23] <wikibugs>	 (03CR) 10RhinosF1: Add a license statement to puppet.git with an overview (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870813 (owner: 10Muehlenhoff)
[10:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:22:42] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "A few open questions inline. Review by someone more familiar with our dumps infrastructure would be welcomed." [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) (owner: 10Bking)
[10:23:57] <wikibugs>	 (03PS2) 10Muehlenhoff: Add a license statement to puppet.git with an overview [puppet] - 10https://gerrit.wikimedia.org/r/870813
[10:24:04] <wikibugs>	 (03CR) 10Muehlenhoff: Add a license statement to puppet.git with an overview (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870813 (owner: 10Muehlenhoff)
[10:24:53] <wikibugs>	 (03PS8) 10Slyngshede: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021
[10:24:55] <wikibugs>	 (03PS2) 10Slyngshede: WIP: Access Requests [software/bitu] - 10https://gerrit.wikimedia.org/r/870747
[10:24:59] <wikibugs>	 (03CR) 10RhinosF1: Add a license statement to puppet.git with an overview (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870813 (owner: 10Muehlenhoff)
[10:34:41] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Darwinius) @Seddon Was probably indexed in the last couple of days, most probably related to it appearing on this thread, since many pages created...
[10:35:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Juniper QFC5120 error logs on lsw1-e1 and lsw1-f1: Failed to get ifl for ifl index - https://phabricator.wikimedia.org/T325801 (10cmooney) p:05Triage→03Low
[10:37:39] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:37:58] <wikibugs>	 (03PS1) 10Btullis: Disable the presto server on the 10 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783)
[10:38:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Disable the presto server on the 10 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783) (owner: 10Btullis)
[10:40:18] <wikibugs>	 (03PS2) 10Btullis: Disable the presto server on the 10 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783)
[10:42:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Disable the presto server on the 10 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783) (owner: 10Btullis)
[10:42:20] <wikibugs>	 (03CR) 10ArielGlenn: query_service: Allow query hosts to rsync data from clouddumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) (owner: 10Bking)
[10:45:05] <wikibugs>	 (03PS3) 10Btullis: Disable the presto server on the 10 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783)
[10:50:38] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38921/console" [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783) (owner: 10Btullis)
[10:54:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:54:12] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) p:05Triage→03High
[10:54:50] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1013.eqiad.wmnet with reason: host reimage
[10:55:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Juniper QFX5120 error logs on lsw1-e1 and lsw1-f1: Failed to get ifl for ifl index - https://phabricator.wikimedia.org/T325801 (10ayounsi)
[10:57:01] <wikibugs>	 (03CR) 10Muehlenhoff: "Code looks good, a few remaining typos and proposed text changes" [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede)
[10:57:25] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Disable the presto server on the 10 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783) (owner: 10Btullis)
[10:57:40] <wikibugs>	 (03PS1) 10JMeybohm: k8s: Add the ClusterIP of kubernetes.default.cluster.local to cert [puppet] - 10https://gerrit.wikimedia.org/r/870820 (https://phabricator.wikimedia.org/T307943)
[10:57:56] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1013.eqiad.wmnet with reason: host reimage
[11:01:26] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[11:05:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:08:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[11:09:57] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[11:11:24] <wikibugs>	 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10cmooney) For the record I had a quick look at the codfw / ulsfo / eqsin / esams virtual-chassis port stats and none of them are showing historical CRC errors.
[11:11:45] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "LGTM! Let's run puppet just to make sure it works as intended (if it is only for 1.23 no diff will pop up probably, but a check is always " [puppet] - 10https://gerrit.wikimedia.org/r/870820 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[11:12:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:13:10] <wikibugs>	 10SRE, 10serviceops: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10ayounsi) I opened {T325806} to get the dashboard back online, I'll re-open it if there is any need. Thanks!
[11:14:56] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 17806
[11:16:16] <wikibugs>	 (03PS1) 10Muehlenhoff: netbox: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870822
[11:17:12] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:17:22] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870822 (owner: 10Muehlenhoff)
[11:23:21] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 17806
[11:26:34] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Disable LDAP auth in debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/661078 (owner: 10Muehlenhoff)
[11:26:57] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Reenable U2F for now [puppet] - 10https://gerrit.wikimedia.org/r/805836 (owner: 10Muehlenhoff)
[11:28:03] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Enable OIDC in Gradle build [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/810867 (https://phabricator.wikimedia.org/T311999) (owner: 10Muehlenhoff)
[11:28:44] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable profile::auto_restarts::service for kpropd [puppet] - 10https://gerrit.wikimedia.org/r/775318 (https://phabricator.wikimedia.org/T135991)
[11:30:28] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[11:30:32] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1013.eqiad.wmnet with OS bullseye
[11:30:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1013.eqi...
[11:32:29] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1014.eqiad.wmnet with OS bullseye
[11:32:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1014...
[11:34:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix up comment wrt use of restrict for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/870824
[11:35:19] <wikibugs>	 (03Abandoned) 10Muehlenhoff: cumin: Switch SSH key config to "restrict" [puppet] - 10https://gerrit.wikimedia.org/r/675083 (owner: 10Muehlenhoff)
[11:36:52] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:44:58] <wikibugs>	 (03PS9) 10Slyngshede: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021
[11:45:05] <wikibugs>	 (03CR) 10Slyngshede: Signup and LDAP flow. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede)
[11:52:25] <wikibugs>	 (03CR) 10Muehlenhoff: "One final typo :-)" [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede)
[11:53:59] <wikibugs>	 (03PS10) 10Slyngshede: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021
[11:54:09] <wikibugs>	 (03CR) 10Slyngshede: Signup and LDAP flow. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede)
[11:56:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede)
[11:58:14] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[12:01:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846
[12:01:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff)
[12:04:21] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: kubeadm: psp: base-pod-security-policies.yaml: allow hostPath volumes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870686 (https://phabricator.wikimedia.org/T325755) (owner: 10Arturo Borrero Gonzalez)
[12:05:55] <wikibugs>	 (03PS1) 10Muehlenhoff: cassandra: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870847
[12:07:00] <wikibugs>	 (03PS2) 10Muehlenhoff: Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846
[12:07:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff)
[12:07:31] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870847 (owner: 10Muehlenhoff)
[12:11:47] <wikibugs>	 (03PS3) 10Muehlenhoff: Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846
[12:12:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff)
[12:18:37] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] admin: create new group deployment-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn)
[12:18:59] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] deployment_server: add keyholder/group config for jenkins-ci deploy [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn)
[12:28:17] <wikibugs>	 (03PS1) 10Btullis: Add a partman recipe for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/870861 (https://phabricator.wikimedia.org/T324670)
[12:28:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a partman recipe for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/870861 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis)
[12:30:21] <wikibugs>	 (03PS2) 10Btullis: Add a partman recipe for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/870861 (https://phabricator.wikimedia.org/T324670)
[12:34:16] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1014.eqiad.wmnet with reason: host reimage
[12:37:18] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1014.eqiad.wmnet with reason: host reimage
[12:49:45] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[12:50:15] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2] Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede)
[12:50:18] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede)
[12:50:59] <wikibugs>	 (03PS3) 10Slyngshede: WIP: Access Requests [software/bitu] - 10https://gerrit.wikimedia.org/r/870747
[13:08:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) I had a closer look as there is support for this kind of graphing and alerting in LibreNMS since a while https://github.com/librenms/librenms/blame/258505ed4429050344...
[13:14:34] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:16:54] <wikibugs>	 10SRE, 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10ayounsi) Thanks. Regardless of VM or physical you can go ahead with decommissioning it.
[13:18:35] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[13:18:40] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1014.eqiad.wmnet with OS bullseye
[13:18:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1014.eqi...
[13:46:50] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1015.eqiad.wmnet with OS bullseye
[13:46:59] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1015...
[13:49:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[13:49:14] <wikibugs>	 (03PS2) 10Raymond Ndibe: tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689)
[13:49:50] <wikibugs>	 (03CR) 10Raymond Ndibe: tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe)
[14:10:37] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (https://phabricator.wikimedia.org/T319401) (owner: 10David Caro)
[14:10:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (https://phabricator.wikimedia.org/T319401) (owner: 10David Caro)
[14:11:00] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:12:13] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] Add moved to wmcs-cookbooks message. (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (https://phabricator.wikimedia.org/T319401) (owner: 10David Caro)
[14:13:16] <wikibugs>	 (03PS4) 10David Caro: Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167
[14:16:03] <wikibugs>	 (03CR) 10Jbond: "not tested but lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[14:16:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[14:23:16] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (owner: 10David Caro)
[14:23:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[14:23:44] <wikibugs>	 (03Merged) 10jenkins-bot: Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (owner: 10David Caro)
[14:27:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[14:37:39] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:42:59] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1015.eqiad.wmnet with OS bullseye
[14:43:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1015.eqi...
[14:43:55] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1015.eqiad.wmnet with OS bullseye
[14:44:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1015...
[14:47:23] <akosiaris>	 !log truncate daemon.log.1 on maps1009 to free up disk space
[14:47:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:18] <icinga-wm>	 RECOVERY - Disk space on maps1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops
[14:59:48] <icinga-wm>	 RECOVERY - puppet last run on maps1009 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[15:12:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:17:04] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/870824 (owner: 10Muehlenhoff)
[15:30:02] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/870813 (owner: 10Muehlenhoff)
[15:37:28] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10Jclark-ctr) @ayounsi   I do have spare optics for connection.  1/5/23  is a good day to perform this maintenance
[15:38:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:40:14] <logmsgbot>	 !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1015.eqiad.wmnet with OS bullseye
[15:40:21] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1015.eqi...
[15:42:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:42:23] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/870900 (https://phabricator.wikimedia.org/T325349)
[15:43:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:45:38] <wikibugs>	 (03PS1) 10JHathaway: rspamd: vendor github.com/oxc/puppet-rspamd [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397)
[15:47:25] <wikibugs>	 (03PS2) 10JHathaway: rspamd: vendor github.com/oxc/puppet-rspamd [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397)
[15:48:21] <wikibugs>	 (03CR) 10JHathaway: "kindly review!" [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397) (owner: 10JHathaway)
[15:53:54] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/870900 (https://phabricator.wikimedia.org/T325349) (owner: 10AikoChou)
[15:54:35] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/870900 (https://phabricator.wikimedia.org/T325349) (owner: 10AikoChou)
[15:56:40] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:57:00] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:57:28] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:57:50] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add the cache.mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/870903
[15:58:13] <wikibugs>	 (03PS1) 10Ayounsi: BGP for NTT in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/870904 (https://phabricator.wikimedia.org/T314929)
[15:58:14] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[15:58:52] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905
[15:59:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 (owner: 10Alexandros Kosiaris)
[16:00:22] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[16:01:05] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905
[16:02:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10cmooney) >>! In T325803#8486630, @ayounsi wrote: > For example `jnxVirtualChassisPortInCRCAlignErrors.5."vcp-255/1/3" = 42` while it has been cleared and should now be at 0....
[16:02:31] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Specify Citoid RESTBase URL separately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) (owner: 10Bartosz Dziewoński)
[16:03:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 (owner: 10Alexandros Kosiaris)
[16:03:54] <wikibugs>	 (03PS18) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[16:03:56] <wikibugs>	 (03PS1) 10Elukey: sre.discovery.service-route: fix bugs [cookbooks] - 10https://gerrit.wikimedia.org/r/870926 (https://phabricator.wikimedia.org/T277677)
[16:04:20] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "Looks good, reasons why as outlined on the task make sense to me." [homer/public] - 10https://gerrit.wikimedia.org/r/867128 (https://phabricator.wikimedia.org/T324955) (owner: 10Ayounsi)
[16:05:05] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905
[16:07:17] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905
[16:07:44] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.discovery.service-route: fix bugs [cookbooks] - 10https://gerrit.wikimedia.org/r/870926 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[16:09:44] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.discovery.service-route check inference: maintenance
[16:09:44] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check inference: maintenance
[16:09:55] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905
[16:10:18] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.discovery.service-route depool inference in codfw: maintenance
[16:11:21] <elukey>	 testing --^
[16:11:22] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38926/console" [puppet] - 10https://gerrit.wikimedia.org/r/870905 (owner: 10Alexandros Kosiaris)
[16:14:28] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) > jnxVirtualChassisPortInCRCAlignErrors is a COUNTER64, so I'm not sure that clearing the device counters should reset what SNMP reports. If it did then LibreNMS woul...
[16:15:21] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool inference in codfw: maintenance
[16:16:11] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905
[16:17:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38927/console" [puppet] - 10https://gerrit.wikimedia.org/r/870905 (owner: 10Alexandros Kosiaris)
[16:17:40] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.discovery.service-route depool inference in eqiad: maintenance
[16:17:41] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) depool inference in eqiad: maintenance
[16:17:53] <elukey>	 perfect
[16:18:05] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.discovery.service-route pool inference in codfw: maintenance
[16:19:37] <wikibugs>	 (03PS7) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905
[16:19:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 (owner: 10Alexandros Kosiaris)
[16:23:08] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool inference in codfw: maintenance
[16:25:41] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 3 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) Current status:  * `sre.discovery.service-route` (used by `sre.k8s.pool-depool-cluster`) has been moved to the...
[16:29:42] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] scap: disable git safe.directory [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar)
[16:40:59] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall) There seems to be a little confusion on what's needed here. Talking with @Nikerabbit on IRC confirmed that WMF-NDA ticket access is not necessary, only logstash access. As per [[ htt...
[16:42:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall) p:05Triage→03High Marking as high priority since this has taken much longer than necessary: Let's get this done.
[16:44:59] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add a partman recipe for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/870861 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis)
[16:50:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[16:51:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[16:51:36] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1015.eqiad.wmnet with OS bullseye
[16:51:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1015...
[16:53:26] <wikibugs>	 (03CR) 10Alexandros Kosiaris: query_service: Allow query hosts to rsync data from clouddumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) (owner: 10Bking)
[16:56:19] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1001.eqiad.wmnet with OS bullseye
[16:57:54] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "Puppet compiler running against restbase-dev1006 is interesting, where did it pick that up from?" [puppet] - 10https://gerrit.wikimedia.org/r/870847 (owner: 10Muehlenhoff)
[16:59:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Juniper QFX5120 error logs on lsw1-e1 and lsw1-f1: Failed to get ifl for ifl index - https://phabricator.wikimedia.org/T325801 (10cmooney) Juniper have come back to say the message is harmless and can be ignored.  > It’s a Harmless error message. >  > I gone thr...
[17:20:08] <wikibugs>	 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10RobH) p:05Triage→03Low
[17:21:33] <wikibugs>	 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10RobH)
[17:22:16] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[17:22:27] <robh>	 removing broken atlas from things
[17:24:58] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: atlas ulsfo decom - robh@cumin2002"
[17:25:47] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: atlas ulsfo decom - robh@cumin2002"
[17:25:47] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:26:26] <wikibugs>	 (03PS1) 10Btullis: Correct the filename of the partman recipe for cephosd [puppet] - 10https://gerrit.wikimedia.org/r/870955 (https://phabricator.wikimedia.org/T324670)
[17:29:43] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Correct the filename of the partman recipe for cephosd [puppet] - 10https://gerrit.wikimedia.org/r/870955 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis)
[17:35:28] <wikibugs>	 10SRE, 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10RobH) 05Open→03Resolved
[17:36:49] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10RobH) a:05RobH→03ayounsi Arzhel,  Does anything need to be done in the RIPE portal now that this atlas is defunct?  I've removed its dns entries and disabled its switch port.  Once the rip...
[17:37:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10EddieGP)
[18:00:36] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1015.eqiad.wmnet with reason: host reimage
[18:03:35] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1015.eqiad.wmnet with reason: host reimage
[18:04:13] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:08:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Wangombe - https://phabricator.wikimedia.org/T325828 (10BCornwall)
[18:08:43] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall) I've been corrected: [[ https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group | Documentation ]] shows that contractors should also be added to the WMF-NDA gr...
[18:09:25] <wikibugs>	 10SRE, 10LDAP-Access-Requests: WMF-NDA access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall) 05Open→03Resolved
[18:09:45] <wikibugs>	 10SRE, 10LDAP-Access-Requests: WMF-NDA access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall)
[18:09:47] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Wangombe - https://phabricator.wikimedia.org/T325828 (10BCornwall) p:05Triage→03High Moving to high priority since this has taken a long time.
[18:12:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[18:16:04] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[18:26:05] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Migrate mediabackups database service to backup1 sections [puppet] - 10https://gerrit.wikimedia.org/r/870964 (https://phabricator.wikimedia.org/T313582)
[18:26:56] <wikibugs>	 (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/870964 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[18:27:11] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001"
[18:27:16] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1015.eqiad.wmnet with OS bullseye
[18:27:22] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1015.eqi...
[18:27:35] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[18:28:13] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mediabackups: Migrate mediabackups database service to backup1 sections [puppet] - 10https://gerrit.wikimedia.org/r/870964 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[18:29:13] <wikibugs>	 (03Merged) 10jenkins-bot: Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[18:32:03] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Reenable notifications on backup1 mariadb instances [puppet] - 10https://gerrit.wikimedia.org/r/868688 (https://phabricator.wikimedia.org/T313582)
[18:32:53] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable notifications on backup1 mariadb instances [puppet] - 10https://gerrit.wikimedia.org/r/868688 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[18:37:40] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:39:22] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Disable notifications of dbs db1176 & db2151 [puppet] - 10https://gerrit.wikimedia.org/r/870967 (https://phabricator.wikimedia.org/T313582)
[18:40:07] <wikibugs>	 (03PS2) 10Jcrespo: mediabackups: Disable notifications of dbs db1176 & db2151 [puppet] - 10https://gerrit.wikimedia.org/r/870967 (https://phabricator.wikimedia.org/T313582)
[18:44:09] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "Another FYI. I am close to return the borrowed host back to you, but I want to keep both in parallel for final checks and making sure they" [puppet] - 10https://gerrit.wikimedia.org/r/870967 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[18:52:43] <wikibugs>	 (03PS2) 10Matthias Mullie: [SearchVue] Enable on ruwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849025 (https://phabricator.wikimedia.org/T311667)
[18:55:08] <wikibugs>	 (03PS3) 10Matthias Mullie: [SearchVue] Enable on ruwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849025 (https://phabricator.wikimedia.org/T311667)
[19:02:42] <wikibugs>	 10SRE, 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10ayounsi) As the box is dead anyway, no need to block on any RIPE portal action. You can proceed with it next time you're onsite.
[19:12:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:14:42] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Decommission db1176 and db2151 to spares; remove mediabackupstemp [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582)
[19:15:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mariadb: Decommission db1176 and db2151 to spares; remove mediabackupstemp [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[19:16:33] <wikibugs>	 (03PS2) 10Jcrespo: mariadb: Decommission db1176 & db2151 to spare; remove mediabackupstemp [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582)
[19:19:34] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-2] "Not ready yet (requires verification of new hosts) but please start reviewing as this is not a trivial decommission. Not urgent, though." [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[19:21:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:26:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:58:14] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[20:09:50] <wikibugs>	 (03PS1) 10Xcollazo: Add a systemd timer to clean up old data related to image_suggestions [puppet] - 10https://gerrit.wikimedia.org/r/870974 (https://phabricator.wikimedia.org/T323614)
[20:11:27] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[20:14:09] <wikibugs>	 (03PS1) 10RobH: adding pdus for racks f[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/870975 (https://phabricator.wikimedia.org/T290899)
[20:14:45] <wikibugs>	 (03CR) 10RobH: [C: 03+2] adding pdus for racks f[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/870975 (https://phabricator.wikimedia.org/T290899) (owner: 10RobH)
[20:47:22] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[20:48:58] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[20:49:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:04:22] <wikibugs>	 (03PS1) 10JHathaway: vrts: don't wrap hiera lookup in Sensitive type [puppet] - 10https://gerrit.wikimedia.org/r/870977
[21:04:37] <wikibugs>	 (03PS1) 10Stang: Revert "trwiki: Add 20 years celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870920
[21:04:58] <wikibugs>	 (03PS2) 10Stang: Revert "trwiki: Add 20 years celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870920 (https://phabricator.wikimedia.org/T325823)
[21:05:11] <wikibugs>	 (03PS3) 10Stang: Revert "trwiki: Add 20 years celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870920 (https://phabricator.wikimedia.org/T325823)
[21:10:30] <wikibugs>	 (03PS1) 10Stang: plwiki: Add editcontentmodel to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870978 (https://phabricator.wikimedia.org/T325819)
[21:15:01] <wikibugs>	 (03Abandoned) 10JHathaway: vrts: don't wrap hiera lookup in Sensitive type [puppet] - 10https://gerrit.wikimedia.org/r/870977 (owner: 10JHathaway)
[21:43:05] <wikibugs>	 (03PS1) 10Stang: kuwiki: Install SandboxLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870988 (https://phabricator.wikimedia.org/T325469)
[21:44:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[22:23:16] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:37:40] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:54:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:59:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:12:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:16:53] <wikibugs>	 (03PS1) 10BCornwall: check_user.py: Fix GSuite misspelling [puppet] - 10https://gerrit.wikimedia.org/r/870994
[23:29:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Wangombe - https://phabricator.wikimedia.org/T325828 (10BCornwall) Hi, @Wangombe, what is your first/last name? I'll need that for the CR that I'll need to create. Thanks!
[23:41:52] <wikibugs>	 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata)
[23:43:54] <wikibugs>	 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) Business on-call  {F35889023} {F35889025}
[23:45:12] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] check_user.py: Fix GSuite misspelling [puppet] - 10https://gerrit.wikimedia.org/r/870994 (owner: 10BCornwall)
[23:45:46] <wikibugs>	 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) {F35889027}
[23:46:48] <wikibugs>	 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) {F35889030}
[23:49:33] <wikibugs>	 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) This screen shows current settings. Removed rotations and step 1 and escalates immediately to batphone.  {F35889032}
[23:50:32] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:50:50] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:55:49] <wikibugs>	 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) 05Open→03Stalled Stalling until Jan 3rd
[23:58:14] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert