[00:00:12] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:44:20] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [00:47:32] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [00:50:46] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [00:53:58] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [01:40:45] (JobUnavailable) firing: (9) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:50] (03CR) 10MSantos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/870535 (owner: 10PipelineBot) [01:48:32] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/870535 (owner: 10PipelineBot) [01:50:32] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [01:50:53] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [01:51:15] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [01:52:05] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [01:52:37] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [01:53:27] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [01:54:24] !log Update verbiage of fundraising banner (T325690) [01:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:28] T325690: Updated message for 2022 English fundraising in iOS app - https://phabricator.wikimedia.org/T325690 [01:55:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:38] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:39] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:46:44] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Soda) >>! In T325607#8485491, @SCherukuwada wrote: > I've mostly focused on search performance and stats for the Wikipedias and haven't had a chanc... [03:08:20] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:09:56] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [03:11:26] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:26:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:35:14] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:38:28] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [04:33:21] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Seddon) Page: https://de.wikisource.org/wiki/Zedler:Puppenwerck Search: https://www.google.de/search?q=nennet+man+%C3%BCberhaupt+alles+Spielwerck... [04:36:22] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [04:39:36] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [04:49:22] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 235 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:51:00] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:16:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:29:42] PROBLEM - puppet last run on maps1009 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:59:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:04:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:14:44] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:37:40] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:37:49] (03PS1) 10Marostegui: analytics-meta.my.cnf.erb: : Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/870744 (https://phabricator.wikimedia.org/T325154) [06:42:44] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [06:46:00] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [06:53:43] (03CR) 10Marostegui: [C: 03+2] analytics-meta.my.cnf.erb: : Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/870744 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [07:10:32] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:14:20] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 28398 [07:14:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 28398 [07:18:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 20485 [07:19:06] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 20485 [07:20:28] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 17806 [07:21:04] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 17806 [07:21:17] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 56286 [07:22:14] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 56286 [07:43:04] 10SRE, 10Traffic: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 (10Vgutierrez) [07:43:58] !log restarting varnish on cp4052 to clear VarnishChildRestarted alert - T325797 [07:44:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:04] T325797: oom killed varnish on cp4052 - https://phabricator.wikimedia.org/T325797 [07:44:39] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10ayounsi) a:03ayounsi [07:47:41] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10ayounsi) Assigning the task to myself to remove the router's static routes after the break. [07:48:59] 10SRE, 10Gerrit, 10serviceops-collab: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631 (10ayounsi) Noted, thanks for the explanation! [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221222T0800) [08:00:08] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:11:35] !log installing libksba security updates [08:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:58] (03PS1) 10Muehlenhoff: Add library hint for libksba [puppet] - 10https://gerrit.wikimedia.org/r/870746 [08:15:01] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10dcaro) [08:15:41] 10ops-drmrs: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10ayounsi) [08:16:29] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libksba [puppet] - 10https://gerrit.wikimedia.org/r/870746 (owner: 10Muehlenhoff) [08:25:12] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10dcaro) [08:31:14] (03PS1) 10Slyngshede: WIP: Access Requests [software/bitu] - 10https://gerrit.wikimedia.org/r/870747 [08:34:25] (03CR) 10David Caro: Add moved to wmcs-cookbooks message. (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (owner: 10David Caro) [08:34:48] (03PS3) 10David Caro: Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (https://phabricator.wikimedia.org/T319401) [08:35:15] (03CR) 10CI reject: [V: 04-1] Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (https://phabricator.wikimedia.org/T319401) (owner: 10David Caro) [08:36:03] (03PS2) 10David Caro: alertmanager: format a bit nicer the default args [puppet] - 10https://gerrit.wikimedia.org/r/868634 (https://phabricator.wikimedia.org/T323714) [08:36:13] (03CR) 10CI reject: [V: 04-1] alertmanager: format a bit nicer the default args [puppet] - 10https://gerrit.wikimedia.org/r/868634 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [08:36:15] (03PS2) 10David Caro: karma: add metrcsinfra alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) [08:36:24] (03CR) 10CI reject: [V: 04-1] karma: add metrcsinfra alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/868638 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [08:36:55] (03Abandoned) 10David Caro: Revert "cumin: add an audit report for insetup servers" [puppet] - 10https://gerrit.wikimedia.org/r/866699 (owner: 10David Caro) [09:22:35] (03PS7) 10Slyngshede: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 [09:25:53] (03PS5) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) [09:27:09] (03CR) 10Slyngshede: Signup and LDAP flow. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede) [09:27:44] (03CR) 10CI reject: [V: 04-1] dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [09:29:23] (03PS1) 10Muehlenhoff: os-updates-report: Allow passing an additional owners file [puppet] - 10https://gerrit.wikimedia.org/r/870749 [09:34:07] (03CR) 10Muehlenhoff: [C: 03+2] os-updates-report: Allow passing an additional owners file [puppet] - 10https://gerrit.wikimedia.org/r/870749 (owner: 10Muehlenhoff) [09:38:31] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1013.eqiad.wmnet with OS bullseye [09:38:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1013... [09:39:47] (03CR) 10Ayounsi: "Some comments but overall this goes in the good direction!" [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [09:40:15] (03PS1) 10Muehlenhoff: os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750 [09:40:36] (03CR) 10CI reject: [V: 04-1] os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750 (owner: 10Muehlenhoff) [09:41:23] (03PS2) 10Muehlenhoff: os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750 [09:42:29] (03PS3) 10Muehlenhoff: os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750 [09:44:16] (03CR) 10CI reject: [V: 04-1] os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750 (owner: 10Muehlenhoff) [09:44:28] (03CR) 10Ayounsi: [C: 03+1] "LGTM but let's not deploy it before the end of year break." [homer/public] - 10https://gerrit.wikimedia.org/r/869736 (owner: 10Majavah) [09:46:14] (03PS4) 10Muehlenhoff: os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750 [09:50:10] (03CR) 10Muehlenhoff: [C: 03+2] os-reports: Add owner overrides [puppet] - 10https://gerrit.wikimedia.org/r/870750 (owner: 10Muehlenhoff) [09:55:58] (03PS1) 10Marostegui: mariadb: Clarify unix_socket entries [puppet] - 10https://gerrit.wikimedia.org/r/870751 (https://phabricator.wikimedia.org/T325154) [09:58:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Clarify unix_socket entries [puppet] - 10https://gerrit.wikimedia.org/r/870751 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [10:06:48] (03PS1) 10Muehlenhoff: Add a license statement to puppet.git with an overview [puppet] - 10https://gerrit.wikimedia.org/r/870813 [10:09:27] (03PS6) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) [10:10:16] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:23] (03CR) 10RhinosF1: Add a license statement to puppet.git with an overview (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870813 (owner: 10Muehlenhoff) [10:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:22:42] (03CR) 10Gehel: [C: 04-1] "A few open questions inline. Review by someone more familiar with our dumps infrastructure would be welcomed." [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) (owner: 10Bking) [10:23:57] (03PS2) 10Muehlenhoff: Add a license statement to puppet.git with an overview [puppet] - 10https://gerrit.wikimedia.org/r/870813 [10:24:04] (03CR) 10Muehlenhoff: Add a license statement to puppet.git with an overview (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870813 (owner: 10Muehlenhoff) [10:24:53] (03PS8) 10Slyngshede: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 [10:24:55] (03PS2) 10Slyngshede: WIP: Access Requests [software/bitu] - 10https://gerrit.wikimedia.org/r/870747 [10:24:59] (03CR) 10RhinosF1: Add a license statement to puppet.git with an overview (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870813 (owner: 10Muehlenhoff) [10:34:41] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Darwinius) @Seddon Was probably indexed in the last couple of days, most probably related to it appearing on this thread, since many pages created... [10:35:46] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper QFC5120 error logs on lsw1-e1 and lsw1-f1: Failed to get ifl for ifl index - https://phabricator.wikimedia.org/T325801 (10cmooney) p:05Triage→03Low [10:37:39] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:37:58] (03PS1) 10Btullis: Disable the presto server on the 10 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783) [10:38:25] (03CR) 10CI reject: [V: 04-1] Disable the presto server on the 10 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783) (owner: 10Btullis) [10:40:18] (03PS2) 10Btullis: Disable the presto server on the 10 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783) [10:42:16] (03CR) 10CI reject: [V: 04-1] Disable the presto server on the 10 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783) (owner: 10Btullis) [10:42:20] (03CR) 10ArielGlenn: query_service: Allow query hosts to rsync data from clouddumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) (owner: 10Bking) [10:45:05] (03PS3) 10Btullis: Disable the presto server on the 10 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783) [10:50:38] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38921/console" [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783) (owner: 10Btullis) [10:54:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:54:12] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) p:05Triage→03High [10:54:50] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1013.eqiad.wmnet with reason: host reimage [10:55:21] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper QFX5120 error logs on lsw1-e1 and lsw1-f1: Failed to get ifl for ifl index - https://phabricator.wikimedia.org/T325801 (10ayounsi) [10:57:01] (03CR) 10Muehlenhoff: "Code looks good, a few remaining typos and proposed text changes" [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede) [10:57:25] (03CR) 10Btullis: [V: 03+1 C: 03+2] Disable the presto server on the 10 new hosts [puppet] - 10https://gerrit.wikimedia.org/r/870819 (https://phabricator.wikimedia.org/T323783) (owner: 10Btullis) [10:57:40] (03PS1) 10JMeybohm: k8s: Add the ClusterIP of kubernetes.default.cluster.local to cert [puppet] - 10https://gerrit.wikimedia.org/r/870820 (https://phabricator.wikimedia.org/T307943) [10:57:56] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1013.eqiad.wmnet with reason: host reimage [11:01:26] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [11:05:45] (JobUnavailable) resolved: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:08:45] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [11:09:57] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [11:11:24] 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10cmooney) For the record I had a quick look at the codfw / ulsfo / eqsin / esams virtual-chassis port stats and none of them are showing historical CRC errors. [11:11:45] (03CR) 10Elukey: [C: 03+1] "LGTM! Let's run puppet just to make sure it works as intended (if it is only for 1.23 no diff will pop up probably, but a check is always " [puppet] - 10https://gerrit.wikimedia.org/r/870820 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [11:12:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:13:10] 10SRE, 10serviceops: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10ayounsi) I opened {T325806} to get the dashboard back online, I'll re-open it if there is any need. Thanks! [11:14:56] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 17806 [11:16:16] (03PS1) 10Muehlenhoff: netbox: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870822 [11:17:12] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870822 (owner: 10Muehlenhoff) [11:23:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 17806 [11:26:34] (03Abandoned) 10Muehlenhoff: Disable LDAP auth in debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/661078 (owner: 10Muehlenhoff) [11:26:57] (03Abandoned) 10Muehlenhoff: Reenable U2F for now [puppet] - 10https://gerrit.wikimedia.org/r/805836 (owner: 10Muehlenhoff) [11:28:03] (03Abandoned) 10Muehlenhoff: Enable OIDC in Gradle build [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/810867 (https://phabricator.wikimedia.org/T311999) (owner: 10Muehlenhoff) [11:28:44] (03PS3) 10Muehlenhoff: Enable profile::auto_restarts::service for kpropd [puppet] - 10https://gerrit.wikimedia.org/r/775318 (https://phabricator.wikimedia.org/T135991) [11:30:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [11:30:32] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1013.eqiad.wmnet with OS bullseye [11:30:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1013.eqi... [11:32:29] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1014.eqiad.wmnet with OS bullseye [11:32:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1014... [11:34:43] (03PS1) 10Muehlenhoff: Fix up comment wrt use of restrict for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/870824 [11:35:19] (03Abandoned) 10Muehlenhoff: cumin: Switch SSH key config to "restrict" [puppet] - 10https://gerrit.wikimedia.org/r/675083 (owner: 10Muehlenhoff) [11:36:52] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:58] (03PS9) 10Slyngshede: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 [11:45:05] (03CR) 10Slyngshede: Signup and LDAP flow. (033 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede) [11:52:25] (03CR) 10Muehlenhoff: "One final typo :-)" [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede) [11:53:59] (03PS10) 10Slyngshede: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 [11:54:09] (03CR) 10Slyngshede: Signup and LDAP flow. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede) [11:56:28] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede) [11:58:14] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:01:32] (03PS1) 10Muehlenhoff: Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 [12:01:51] (03CR) 10CI reject: [V: 04-1] Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff) [12:04:21] (03CR) 10Arturo Borrero Gonzalez: kubeadm: psp: base-pod-security-policies.yaml: allow hostPath volumes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870686 (https://phabricator.wikimedia.org/T325755) (owner: 10Arturo Borrero Gonzalez) [12:05:55] (03PS1) 10Muehlenhoff: cassandra: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870847 [12:07:00] (03PS2) 10Muehlenhoff: Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 [12:07:25] (03CR) 10CI reject: [V: 04-1] Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff) [12:07:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870847 (owner: 10Muehlenhoff) [12:11:47] (03PS3) 10Muehlenhoff: Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 [12:12:15] (03CR) 10CI reject: [V: 04-1] Java: Remove support for stretch [puppet] - 10https://gerrit.wikimedia.org/r/870846 (owner: 10Muehlenhoff) [12:18:37] (03CR) 10Jaime Nuche: [C: 03+1] admin: create new group deployment-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [12:18:59] (03CR) 10Jaime Nuche: [C: 03+1] deployment_server: add keyholder/group config for jenkins-ci deploy [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [12:28:17] (03PS1) 10Btullis: Add a partman recipe for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/870861 (https://phabricator.wikimedia.org/T324670) [12:28:42] (03CR) 10CI reject: [V: 04-1] Add a partman recipe for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/870861 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [12:30:21] (03PS2) 10Btullis: Add a partman recipe for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/870861 (https://phabricator.wikimedia.org/T324670) [12:34:16] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1014.eqiad.wmnet with reason: host reimage [12:37:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1014.eqiad.wmnet with reason: host reimage [12:49:45] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [12:50:15] (03CR) 10Slyngshede: [V: 03+2] Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede) [12:50:18] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 (owner: 10Slyngshede) [12:50:59] (03PS3) 10Slyngshede: WIP: Access Requests [software/bitu] - 10https://gerrit.wikimedia.org/r/870747 [13:08:53] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) I had a closer look as there is support for this kind of graphing and alerting in LibreNMS since a while https://github.com/librenms/librenms/blame/258505ed4429050344... [13:14:34] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:54] 10SRE, 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10ayounsi) Thanks. Regardless of VM or physical you can go ahead with decommissioning it. [13:18:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [13:18:40] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1014.eqiad.wmnet with OS bullseye [13:18:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1014.eqi... [13:46:50] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1015.eqiad.wmnet with OS bullseye [13:46:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1015... [13:49:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:49:14] (03PS2) 10Raymond Ndibe: tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) [13:49:50] (03CR) 10Raymond Ndibe: tools-webservice: read DEFAULT_BUILD_SERVICE_REGISTRY from config (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [14:10:37] (03CR) 10David Caro: [V: 03+1 C: 03+2] Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (https://phabricator.wikimedia.org/T319401) (owner: 10David Caro) [14:10:59] (03CR) 10CI reject: [V: 04-1] Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (https://phabricator.wikimedia.org/T319401) (owner: 10David Caro) [14:11:00] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:13] (03CR) 10FNegri: [C: 03+1] Add moved to wmcs-cookbooks message. (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (https://phabricator.wikimedia.org/T319401) (owner: 10David Caro) [14:13:16] (03PS4) 10David Caro: Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 [14:16:03] (03CR) 10Jbond: "not tested but lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [14:16:14] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [14:23:16] (03CR) 10David Caro: [C: 03+2] Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (owner: 10David Caro) [14:23:18] (03CR) 10Elukey: [C: 03+2] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [14:23:44] (03Merged) 10jenkins-bot: Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (owner: 10David Caro) [14:27:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:37:39] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:42:59] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1015.eqiad.wmnet with OS bullseye [14:43:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1015.eqi... [14:43:55] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1015.eqiad.wmnet with OS bullseye [14:44:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1015... [14:47:23] !log truncate daemon.log.1 on maps1009 to free up disk space [14:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:18] RECOVERY - Disk space on maps1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops [14:59:48] RECOVERY - puppet last run on maps1009 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [15:12:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:17:04] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/870824 (owner: 10Muehlenhoff) [15:30:02] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/870813 (owner: 10Muehlenhoff) [15:37:28] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10Jclark-ctr) @ayounsi I do have spare optics for connection. 1/5/23 is a good day to perform this maintenance [15:38:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:40:14] !log btullis@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-jumbo1015.eqiad.wmnet with OS bullseye [15:40:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1015.eqi... [15:42:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:42:23] (03PS1) 10AikoChou: ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/870900 (https://phabricator.wikimedia.org/T325349) [15:43:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:45:38] (03PS1) 10JHathaway: rspamd: vendor github.com/oxc/puppet-rspamd [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397) [15:47:25] (03PS2) 10JHathaway: rspamd: vendor github.com/oxc/puppet-rspamd [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397) [15:48:21] (03CR) 10JHathaway: "kindly review!" [puppet] - 10https://gerrit.wikimedia.org/r/870901 (https://phabricator.wikimedia.org/T325397) (owner: 10JHathaway) [15:53:54] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/870900 (https://phabricator.wikimedia.org/T325349) (owner: 10AikoChou) [15:54:35] (03CR) 10Elukey: [C: 03+2] ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/870900 (https://phabricator.wikimedia.org/T325349) (owner: 10AikoChou) [15:56:40] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:57:00] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:57:28] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:57:50] (03PS1) 10Giuseppe Lavagetto: Add the cache.mcrouter module [deployment-charts] - 10https://gerrit.wikimedia.org/r/870903 [15:58:13] (03PS1) 10Ayounsi: BGP for NTT in drmrs [homer/public] - 10https://gerrit.wikimedia.org/r/870904 (https://phabricator.wikimedia.org/T314929) [15:58:14] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:58:52] (03PS1) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 [15:59:16] (03CR) 10CI reject: [V: 04-1] imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 (owner: 10Alexandros Kosiaris) [16:00:22] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:01:05] (03PS2) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 [16:02:04] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10cmooney) >>! In T325803#8486630, @ayounsi wrote: > For example `jnxVirtualChassisPortInCRCAlignErrors.5."vcp-255/1/3" = 42` while it has been cleared and should now be at 0.... [16:02:31] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Specify Citoid RESTBase URL separately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) (owner: 10Bartosz Dziewoński) [16:03:05] (03CR) 10Giuseppe Lavagetto: [C: 03+1] imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 (owner: 10Alexandros Kosiaris) [16:03:54] (03PS18) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [16:03:56] (03PS1) 10Elukey: sre.discovery.service-route: fix bugs [cookbooks] - 10https://gerrit.wikimedia.org/r/870926 (https://phabricator.wikimedia.org/T277677) [16:04:20] (03CR) 10Cathal Mooney: [C: 03+1] "Looks good, reasons why as outlined on the task make sense to me." [homer/public] - 10https://gerrit.wikimedia.org/r/867128 (https://phabricator.wikimedia.org/T324955) (owner: 10Ayounsi) [16:05:05] (03PS3) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 [16:07:17] (03PS4) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 [16:07:44] (03CR) 10Elukey: [C: 03+2] sre.discovery.service-route: fix bugs [cookbooks] - 10https://gerrit.wikimedia.org/r/870926 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [16:09:44] !log elukey@cumin1001 START - Cookbook sre.discovery.service-route check inference: maintenance [16:09:44] !log elukey@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check inference: maintenance [16:09:55] (03PS5) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 [16:10:18] !log elukey@cumin1001 START - Cookbook sre.discovery.service-route depool inference in codfw: maintenance [16:11:21] testing --^ [16:11:22] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38926/console" [puppet] - 10https://gerrit.wikimedia.org/r/870905 (owner: 10Alexandros Kosiaris) [16:14:28] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: [eqiad] faulty VC optics - https://phabricator.wikimedia.org/T325803 (10ayounsi) > jnxVirtualChassisPortInCRCAlignErrors is a COUNTER64, so I'm not sure that clearing the device counters should reset what SNMP reports. If it did then LibreNMS woul... [16:15:21] !log elukey@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool inference in codfw: maintenance [16:16:11] (03PS6) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 [16:17:40] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38927/console" [puppet] - 10https://gerrit.wikimedia.org/r/870905 (owner: 10Alexandros Kosiaris) [16:17:40] !log elukey@cumin1001 START - Cookbook sre.discovery.service-route depool inference in eqiad: maintenance [16:17:41] !log elukey@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) depool inference in eqiad: maintenance [16:17:53] perfect [16:18:05] !log elukey@cumin1001 START - Cookbook sre.discovery.service-route pool inference in codfw: maintenance [16:19:37] (03PS7) 10Alexandros Kosiaris: imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 [16:19:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] imposm: Ship rsyslog/lograte rules and make it quiet [puppet] - 10https://gerrit.wikimedia.org/r/870905 (owner: 10Alexandros Kosiaris) [16:23:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) pool inference in codfw: maintenance [16:25:41] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 3 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) Current status: * `sre.discovery.service-route` (used by `sre.k8s.pool-depool-cluster`) has been moved to the... [16:29:42] (03CR) 10Jaime Nuche: [C: 03+1] scap: disable git safe.directory [puppet] - 10https://gerrit.wikimedia.org/r/868002 (https://phabricator.wikimedia.org/T325128) (owner: 10Hashar) [16:40:59] 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall) There seems to be a little confusion on what's needed here. Talking with @Nikerabbit on IRC confirmed that WMF-NDA ticket access is not necessary, only logstash access. As per [[ htt... [16:42:14] 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall) p:05Triage→03High Marking as high priority since this has taken much longer than necessary: Let's get this done. [16:44:59] (03CR) 10Btullis: [C: 03+2] Add a partman recipe for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/870861 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [16:50:15] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:51:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:51:36] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-jumbo1015.eqiad.wmnet with OS bullseye [16:51:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1001 for host kafka-jumbo1015... [16:53:26] (03CR) 10Alexandros Kosiaris: query_service: Allow query hosts to rsync data from clouddumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) (owner: 10Bking) [16:56:19] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1001.eqiad.wmnet with OS bullseye [16:57:54] (03CR) 10Eevans: [C: 03+1] "Puppet compiler running against restbase-dev1006 is interesting, where did it pick that up from?" [puppet] - 10https://gerrit.wikimedia.org/r/870847 (owner: 10Muehlenhoff) [16:59:41] 10SRE, 10Infrastructure-Foundations, 10netops: Juniper QFX5120 error logs on lsw1-e1 and lsw1-f1: Failed to get ifl for ifl index - https://phabricator.wikimedia.org/T325801 (10cmooney) Juniper have come back to say the message is harmless and can be ignored. > It’s a Harmless error message. > > I gone thr... [17:20:08] 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10RobH) p:05Triage→03Low [17:21:33] 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10RobH) [17:22:16] !log robh@cumin2002 START - Cookbook sre.dns.netbox [17:22:27] removing broken atlas from things [17:24:58] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: atlas ulsfo decom - robh@cumin2002" [17:25:47] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: atlas ulsfo decom - robh@cumin2002" [17:25:47] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:26:26] (03PS1) 10Btullis: Correct the filename of the partman recipe for cephosd [puppet] - 10https://gerrit.wikimedia.org/r/870955 (https://phabricator.wikimedia.org/T324670) [17:29:43] (03CR) 10Btullis: [C: 03+2] Correct the filename of the partman recipe for cephosd [puppet] - 10https://gerrit.wikimedia.org/r/870955 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [17:35:28] 10SRE, 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10RobH) 05Open→03Resolved [17:36:49] 10SRE, 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10RobH) a:05RobH→03ayounsi Arzhel, Does anything need to be done in the RIPE portal now that this atlas is defunct? I've removed its dns entries and disabled its switch port. Once the rip... [17:37:19] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10EddieGP) [18:00:36] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-jumbo1015.eqiad.wmnet with reason: host reimage [18:03:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-jumbo1015.eqiad.wmnet with reason: host reimage [18:04:13] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - ASunknown/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:08:23] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Wangombe - https://phabricator.wikimedia.org/T325828 (10BCornwall) [18:08:43] 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall) I've been corrected: [[ https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#WMF_Group | Documentation ]] shows that contractors should also be added to the WMF-NDA gr... [18:09:25] 10SRE, 10LDAP-Access-Requests: WMF-NDA access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall) 05Open→03Resolved [18:09:45] 10SRE, 10LDAP-Access-Requests: WMF-NDA access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall) [18:09:47] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Wangombe - https://phabricator.wikimedia.org/T325828 (10BCornwall) p:05Triage→03High Moving to high priority since this has taken a long time. [18:12:28] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [18:16:04] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [18:26:05] (03PS1) 10Jcrespo: mediabackups: Migrate mediabackups database service to backup1 sections [puppet] - 10https://gerrit.wikimedia.org/r/870964 (https://phabricator.wikimedia.org/T313582) [18:26:56] (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/870964 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [18:27:11] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [18:27:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-jumbo1015.eqiad.wmnet with OS bullseye [18:27:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure (EQ2 Kanban (Sprints 04-05)): Q4:(Need By: TBD) rack/setup/install kafka-jumbo101[0-5] - https://phabricator.wikimedia.org/T306939 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1001 for host kafka-jumbo1015.eqi... [18:27:35] (03CR) 10RLazarus: [C: 03+2] Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [18:28:13] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Migrate mediabackups database service to backup1 sections [puppet] - 10https://gerrit.wikimedia.org/r/870964 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [18:29:13] (03Merged) 10jenkins-bot: Add an option, off by default, to retry once when a request times out. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [18:32:03] (03PS2) 10Jcrespo: mariadb: Reenable notifications on backup1 mariadb instances [puppet] - 10https://gerrit.wikimedia.org/r/868688 (https://phabricator.wikimedia.org/T313582) [18:32:53] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable notifications on backup1 mariadb instances [puppet] - 10https://gerrit.wikimedia.org/r/868688 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [18:37:40] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:39:22] (03PS1) 10Jcrespo: mediabackups: Disable notifications of dbs db1176 & db2151 [puppet] - 10https://gerrit.wikimedia.org/r/870967 (https://phabricator.wikimedia.org/T313582) [18:40:07] (03PS2) 10Jcrespo: mediabackups: Disable notifications of dbs db1176 & db2151 [puppet] - 10https://gerrit.wikimedia.org/r/870967 (https://phabricator.wikimedia.org/T313582) [18:44:09] (03CR) 10Jcrespo: [C: 03+2] "Another FYI. I am close to return the borrowed host back to you, but I want to keep both in parallel for final checks and making sure they" [puppet] - 10https://gerrit.wikimedia.org/r/870967 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [18:52:43] (03PS2) 10Matthias Mullie: [SearchVue] Enable on ruwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849025 (https://phabricator.wikimedia.org/T311667) [18:55:08] (03PS3) 10Matthias Mullie: [SearchVue] Enable on ruwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849025 (https://phabricator.wikimedia.org/T311667) [19:02:42] 10SRE, 10ops-ulsfo, 10decommission-hardware: decommission atlas-ulsfo - https://phabricator.wikimedia.org/T325824 (10ayounsi) As the box is dead anyway, no need to block on any RIPE portal action. You can proceed with it next time you're onsite. [19:12:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:14:42] (03PS1) 10Jcrespo: mariadb: Decommission db1176 and db2151 to spares; remove mediabackupstemp [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) [19:15:02] (03CR) 10CI reject: [V: 04-1] mariadb: Decommission db1176 and db2151 to spares; remove mediabackupstemp [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [19:16:33] (03PS2) 10Jcrespo: mariadb: Decommission db1176 & db2151 to spare; remove mediabackupstemp [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) [19:19:34] (03CR) 10Jcrespo: [C: 04-2] "Not ready yet (requires verification of new hosts) but please start reviewing as this is not a trivial decommission. Not urgent, though." [puppet] - 10https://gerrit.wikimedia.org/r/870970 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [19:21:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:26:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:58:14] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [20:09:50] (03PS1) 10Xcollazo: Add a systemd timer to clean up old data related to image_suggestions [puppet] - 10https://gerrit.wikimedia.org/r/870974 (https://phabricator.wikimedia.org/T323614) [20:11:27] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:14:09] (03PS1) 10RobH: adding pdus for racks f[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/870975 (https://phabricator.wikimedia.org/T290899) [20:14:45] (03CR) 10RobH: [C: 03+2] adding pdus for racks f[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/870975 (https://phabricator.wikimedia.org/T290899) (owner: 10RobH) [20:47:22] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [20:48:58] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:49:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:04:22] (03PS1) 10JHathaway: vrts: don't wrap hiera lookup in Sensitive type [puppet] - 10https://gerrit.wikimedia.org/r/870977 [21:04:37] (03PS1) 10Stang: Revert "trwiki: Add 20 years celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870920 [21:04:58] (03PS2) 10Stang: Revert "trwiki: Add 20 years celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870920 (https://phabricator.wikimedia.org/T325823) [21:05:11] (03PS3) 10Stang: Revert "trwiki: Add 20 years celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870920 (https://phabricator.wikimedia.org/T325823) [21:10:30] (03PS1) 10Stang: plwiki: Add editcontentmodel to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870978 (https://phabricator.wikimedia.org/T325819) [21:15:01] (03Abandoned) 10JHathaway: vrts: don't wrap hiera lookup in Sensitive type [puppet] - 10https://gerrit.wikimedia.org/r/870977 (owner: 10JHathaway) [21:43:05] (03PS1) 10Stang: kuwiki: Install SandboxLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870988 (https://phabricator.wikimedia.org/T325469) [21:44:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:23:16] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:37:40] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:54:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:59:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:12:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:16:53] (03PS1) 10BCornwall: check_user.py: Fix GSuite misspelling [puppet] - 10https://gerrit.wikimedia.org/r/870994 [23:29:09] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for Wangombe - https://phabricator.wikimedia.org/T325828 (10BCornwall) Hi, @Wangombe, what is your first/last name? I'll need that for the CR that I'll need to create. Thanks! [23:41:52] 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) [23:43:54] 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) Business on-call {F35889023} {F35889025} [23:45:12] (03CR) 10RLazarus: [C: 03+1] check_user.py: Fix GSuite misspelling [puppet] - 10https://gerrit.wikimedia.org/r/870994 (owner: 10BCornwall) [23:45:46] 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) {F35889027} [23:46:48] 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) {F35889030} [23:49:33] 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) This screen shows current settings. Removed rotations and step 1 and escalates immediately to batphone. {F35889032} [23:50:32] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:50:50] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:55:49] 10SRE, 10SRE Observability (FY2022/2023-Q2): WMF SRE holiday paging tracker 2022 - https://phabricator.wikimedia.org/T325856 (10lmata) 05Open→03Stalled Stalling until Jan 3rd [23:58:14] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert