[00:05:31] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:05:55] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:07:25] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49123 bytes in 4.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:08:35] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.242 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:09:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:14:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:19:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:23:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:28:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:31:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:47] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:45:19] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:46:55] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:51:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:56:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:59:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:04:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:20:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:25:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:26:42] 10SRE, 10DiscussionTools, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10tstarling) Since 2022-12-08 22:20 we have had high traffic to action=discussiontoolspageinfo , with a daily peak of around 2k req/s. So that is {T321961}. The outage was more a latency... [01:28:00] (03PS1) 10Tim Starling: Revert "Start mobile DiscussionTools A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961) [01:30:18] (03PS2) 10Tim Starling: Revert "Start mobile DiscussionTools A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961) [01:31:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:36:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:38:45] (JobUnavailable) firing: (4) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:43:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:47] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 5.427e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11 [01:46:14] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961) (owner: 10Tim Starling) [01:47:16] (03CR) 10Platonides: [C: 03+1] Revert "Start mobile DiscussionTools A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961) (owner: 10Tim Starling) [01:48:45] (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:03:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:04:27] PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-namenode-backup-hdfs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:14] (03CR) 10Ladsgroup: [C: 03+1] "I'm around if you want to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961) (owner: 10Tim Starling) [02:08:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [02:24:15] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:15] (JobUnavailable) resolved: (6) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:58] (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificates) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:34:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificates) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:39:10] 10SRE, 10DiscussionTools, 10Patch-For-Review, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10tstarling) discussiontoolspageinfo request rate {F35874913} A bit noisy starting around 14:50, but other days show that kind of pattern near the daily peak. It get... [02:40:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:51:02] (03CR) 10Tim Starling: [C: 03+2] Revert "Start mobile DiscussionTools A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961) (owner: 10Tim Starling) [02:51:43] (03Merged) 10jenkins-bot: Revert "Start mobile DiscussionTools A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961) (owner: 10Tim Starling) [03:02:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:07:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:51] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: disable wgDiscussionToolsABTest T325477 T321961 (duration: 15m 23s) [03:08:57] T325477: large number of 503 errors - https://phabricator.wikimedia.org/T325477 [03:08:57] T321961: [Config Change] Start mobile DiscussionTools A/B test - https://phabricator.wikimedia.org/T321961 [03:10:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:15:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:19:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:24:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:27:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:28:33] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 162 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:29:41] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:30:07] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:31:17] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [03:37:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:43:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:58:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:59:16] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:00:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:01:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:13:19] 10SRE, 10DiscussionTools, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10DLynch) @tstarling can those charts get more granular? I'd be very interested to know whether it was the `transcludedfrom` or `threaditemshtml` prop being requested from `discussiontoolsp... [04:14:16] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:16:07] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [04:16:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:17:41] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [04:22:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:31:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:36:02] 10SRE, 10DiscussionTools, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10tstarling) Since the configuration variable is saved into the Varnish/ATS cache, you can still see it on some pages. For example viewing https://es.m.wikipedia.org/wiki/Sede_de_la_Organiz... [04:41:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:42:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:43:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:44:28] 10SRE, 10DiscussionTools, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10tstarling) The stack trace for the network request shows controller.js init() calling getPageData(): `lang=js // TODO: Isn't this too early to load it? We will only need it if the user... [04:49:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:54:02] 10SRE, 10DiscussionTools, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10DLynch) Yeah, the issue here is (mostly) us including the general DiscussionTools JS for some test-related effects, and it having those early-loading side-effects. We should either pull o... [04:58:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:59:18] 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10Ladsgroup) Doodling some ideas: {F35875012} {F35875011} {F35875010} {F35875009} {F35875008} {F35875007} [05:00:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:04:41] 10SRE, 10DiscussionTools, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10tstarling) Does the response have any private data in it? I think if ApiDiscussionToolsPageInfo::execute() called $this->getMain()->setCacheMode( 'public' ) and you set the query string p... [05:09:01] (03CR) 10Ladsgroup: [C: 03+1] "If we are sure all contributors have agreed. I think there is one that's banned?" [puppet] - 10https://gerrit.wikimedia.org/r/868707 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [05:09:18] 10SRE, 10DiscussionTools, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10tstarling) [05:09:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:10:34] 10SRE, 10DiscussionTools, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10DLynch) It shouldn't. That particular call is just asking whether anything on the page is inside a transclusion, to work out whether it can actually be u... [05:22:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:40:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:40:47] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:45:04] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:46:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:47:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:51:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:56:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:01:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:06:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:12:47] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:16:47] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [06:21:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:25:16] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Refactor: Migrate from attrs to dataclasses [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 (owner: 10RLazarus) [06:26:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:31:05] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:31:49] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:38:41] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:39:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:44:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:45:31] 10SRE, 10DiscussionTools, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10Ladsgroup) I might be missing something obvious but https://es.m.wikipedia.org/wiki/Sede_de_la_Organizaci%C3%B3n_de_las_Naciones_Unidas is an article. Wh... [06:46:27] 10SRE, 10DiscussionTools, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10DLynch) See: > the issue here is (mostly) us including the general DiscussionTools JS for some test-related effects, and it having those early-loading si... [06:48:01] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 11686 [06:48:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:51:29] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:53:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:55:05] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:55:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:55:49] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:58:11] (03PS1) 10Marostegui: db1207-db1229: Set up new hosts [puppet] - 10https://gerrit.wikimedia.org/r/869049 (https://phabricator.wikimedia.org/T325209) [06:59:07] (03CR) 10CI reject: [V: 04-1] db1207-db1229: Set up new hosts [puppet] - 10https://gerrit.wikimedia.org/r/869049 (https://phabricator.wikimedia.org/T325209) (owner: 10Marostegui) [07:00:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:02:05] (03PS2) 10Marostegui: db1207-db1229: Set up new hosts [puppet] - 10https://gerrit.wikimedia.org/r/869049 (https://phabricator.wikimedia.org/T325209) [07:02:41] (03CR) 10Marostegui: [C: 03+2] db1207-db1229: Set up new hosts [puppet] - 10https://gerrit.wikimedia.org/r/869049 (https://phabricator.wikimedia.org/T325209) (owner: 10Marostegui) [07:03:55] (03PS1) 10Marostegui: db2185-db2187: Add header [puppet] - 10https://gerrit.wikimedia.org/r/869050 [07:04:01] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 11686 [07:04:17] (03CR) 10Marostegui: [C: 03+2] db2185-db2187: Add header [puppet] - 10https://gerrit.wikimedia.org/r/869050 (owner: 10Marostegui) [07:04:41] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 136 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:05:15] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 28398 [07:05:29] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 28398 [07:06:17] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:07:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:12:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:15:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:18:46] 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10ayounsi) [07:22:33] !log phedenskog@deploy1002 Started deploy [performance/navtiming@5770d46]: (no justification provided) [07:22:42] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@5770d46]: (no justification provided) (duration: 00m 08s) [07:25:39] 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10ayounsi) @BCornwall I took care of both of them :) 185.15.56.1 is the generic NAT IP for the WMCS realm. @aborrero, for context, 208.80.153.254 is our old recursive DNS IP and it'... [07:25:45] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:53:15] (03CR) 10Muehlenhoff: lists: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868707 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:54:09] (03CR) 10Muehlenhoff: [C: 03+2] orchestrator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868708 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:54:14] (03PS2) 10Muehlenhoff: orchestrator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868708 (https://phabricator.wikimedia.org/T308013) [07:54:37] (03CR) 10Ladsgroup: [C: 03+1] lists: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868707 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221219T0800) [08:00:21] (03CR) 10Muehlenhoff: lists: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868707 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:02:09] !log installing openexr security updates [08:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:40] (03PS1) 10Ladsgroup: Emergency: discussiontoolspageinfo return empty response in non-talk ns [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) [08:08:06] (03CR) 10Ladsgroup: [C: 03+2] Emergency: discussiontoolspageinfo return empty response in non-talk ns [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) (owner: 10Ladsgroup) [08:11:26] (03PS1) 10Marostegui: site.pp: Remove db1206 testing [puppet] - 10https://gerrit.wikimedia.org/r/869164 [08:11:39] (03CR) 10Ladsgroup: Emergency: discussiontoolspageinfo return empty response in non-talk ns [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) (owner: 10Ladsgroup) [08:11:55] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db1206 testing [puppet] - 10https://gerrit.wikimedia.org/r/869164 (owner: 10Marostegui) [08:19:55] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:24:23] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:25:59] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:26:13] (03CR) 10Volans: ensure_canary: 0-pad the instance counter (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott) [08:27:29] (03CR) 10Volans: [C: 03+1] "LGTM" [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 (owner: 10RLazarus) [08:29:07] (03CR) 10Volans: [C: 03+1] "LGTM" [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [08:29:43] (03CR) 10David Caro: "This is really unfortunate, we have a bunch of VMs whose numbering is not padded (and aim to not be so, as we might pass the 99 barrier)" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott) [08:32:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 29535 [08:33:03] (03CR) 10David Caro: ensure_canary: 0-pad the instance counter (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott) [08:33:20] (03CR) 10Volans: "question inline" [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [08:33:22] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 29535 [08:33:56] (03PS1) 10Aqu: Test to debug missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) [08:34:16] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [08:35:59] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38860/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [08:37:58] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [08:39:35] (03CR) 10Ayounsi: First stab at possible ferm::qos resource for DSCP marking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [08:40:59] (03PS2) 10Aqu: Test to debug missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) [08:40:59] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:41:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:42:25] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38861/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [08:45:41] (03PS3) 10Aqu: Test to debug missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) [08:45:59] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:46:00] (03CR) 10CI reject: [V: 04-1] Test to debug missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [08:46:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:46:58] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38862/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [08:50:49] (03CR) 10Ladsgroup: [C: 03+2] Emergency: discussiontoolspageinfo return empty response in non-talk ns [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) (owner: 10Ladsgroup) [08:53:21] (03PS1) 10David Caro: Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 [08:53:45] (03CR) 10CI reject: [V: 04-1] Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (owner: 10David Caro) [08:54:12] (03PS4) 10Aqu: Test to debug missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) [08:54:14] (03CR) 10David Caro: [C: 04-1] ensure_canary: use the smaller cloudvirt-canary-ceph flavor (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868799 (owner: 10Andrew Bogott) [08:55:22] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38863/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [08:56:01] (03PS2) 10David Caro: Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 [08:56:21] (03CR) 10CI reject: [V: 04-1] Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (owner: 10David Caro) [08:56:43] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [08:57:08] (03Merged) 10jenkins-bot: Emergency: discussiontoolspageinfo return empty response in non-talk ns [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) (owner: 10Ladsgroup) [08:58:55] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [08:59:40] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:01:27] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:02:54] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:02:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:03:45] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:04:06] !log restarting blazegraph on wdqs1015 (BlazegraphFreeAllocatorsDecreasingRapidly) [09:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:20] (03PS1) 10Muehlenhoff: Add end date and contact for aitolkyn's access [puppet] - 10https://gerrit.wikimedia.org/r/869170 [09:05:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) (owner: 10Ladsgroup) [09:05:47] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:868867|Emergency: discussiontoolspageinfo return empty response in non-talk ns (T325477)]] [09:05:51] T325477: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 [09:06:02] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:06:29] (03CR) 10David Caro: "CI does not like that tox does not generate any logs :/, but I think we can merge this as there's no more code to build on it expected any" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (owner: 10David Caro) [09:07:35] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:868867|Emergency: discussiontoolspageinfo return empty response in non-talk ns (T325477)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [09:07:37] (03CR) 10Muehlenhoff: [C: 03+2] Add end date and contact for aitolkyn's access [puppet] - 10https://gerrit.wikimedia.org/r/869170 (owner: 10Muehlenhoff) [09:07:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:08:11] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38864/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [09:11:08] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:12:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:12:59] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:14:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:15:12] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:868867|Emergency: discussiontoolspageinfo return empty response in non-talk ns (T325477)]] (duration: 09m 24s) [09:15:15] T325477: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 [09:16:14] (03PS2) 10Muehlenhoff: sre.misc-clusters.roll-restart-reboot-eventschemas: Also restart envoyproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/860556 [09:17:42] !log About to deploy analytics/refinery (bug fix in HDFS usage pipeline) [09:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:14] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:18:45] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:18:56] (03CR) 10Muehlenhoff: [C: 03+2] Remove d-i-test from special handling [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/841919 (owner: 10Muehlenhoff) [09:19:28] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:19:36] !log aqu@deploy1002 Started deploy [analytics/refinery@2d53aff] (hadoop-test): Fix bug fix in HDFS usage pipeline TEST [analytics/refinery@2d53aff] [09:20:51] !log aqu@deploy1002 Finished deploy [analytics/refinery@2d53aff] (hadoop-test): Fix bug fix in HDFS usage pipeline TEST [analytics/refinery@2d53aff] (duration: 01m 14s) [09:21:24] !log aqu@deploy1002 Started deploy [analytics/refinery@2d53aff]: Fix bug fix in HDFS usage pipeline [analytics/refinery@2d53aff] [09:24:03] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10jnuche) > I think I'd vote contint-root, but a question I have is: is there a way to add the contint-roo... [09:24:46] 10SRE, 10Gerrit, 10serviceops-collab: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631 (10ayounsi) @Dzahn, why is this not relevant anymore? [09:27:28] (03PS6) 10Slyngshede: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 [09:28:36] (03CR) 10Ayounsi: "Why not removing "NO_PUPPETDB_VMS" as well? It would help keep the code lean." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/841919 (owner: 10Muehlenhoff) [09:29:27] !log aqu@deploy1002 Finished deploy [analytics/refinery@2d53aff]: Fix bug fix in HDFS usage pipeline [analytics/refinery@2d53aff] (duration: 08m 02s) [09:29:53] !log aqu@deploy1002 Started deploy [analytics/refinery@2d53aff] (thin): Fix bug fix in HDFS usage pipeline THIN [analytics/refinery@2d53aff] [09:30:01] !log aqu@deploy1002 Finished deploy [analytics/refinery@2d53aff] (thin): Fix bug fix in HDFS usage pipeline THIN [analytics/refinery@2d53aff] (duration: 00m 08s) [09:33:52] (03PS1) 10Volans: cumin::cloud_master: add openstack dependencies [puppet] - 10https://gerrit.wikimedia.org/r/869173 (https://phabricator.wikimedia.org/T319401) [09:39:53] (03PS1) 10Ayounsi: Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 [09:40:46] (03CR) 10CI reject: [V: 04-1] Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi) [09:41:11] (03PS2) 10Ayounsi: Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 [09:41:17] (03PS1) 10Muehlenhoff: Remove access for three researchers [puppet] - 10https://gerrit.wikimedia.org/r/869176 [09:41:53] (03CR) 10CI reject: [V: 04-1] Remove access for three researchers [puppet] - 10https://gerrit.wikimedia.org/r/869176 (owner: 10Muehlenhoff) [09:42:07] (03CR) 10CI reject: [V: 04-1] Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi) [09:43:13] (03PS2) 10Muehlenhoff: Remove access for three researchers [puppet] - 10https://gerrit.wikimedia.org/r/869176 [09:43:50] (03CR) 10CI reject: [V: 04-1] Remove access for three researchers [puppet] - 10https://gerrit.wikimedia.org/r/869176 (owner: 10Muehlenhoff) [09:43:53] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38865/console" [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [09:44:55] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812239 (owner: 10Muehlenhoff) [09:44:59] (03CR) 10Ayounsi: "The CI error doesn't seem to be related to this CR." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi) [09:45:27] (03CR) 10Majavah: "somehow the PCC link does not work :(" [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [09:45:52] (03PS3) 10Muehlenhoff: Remove access for three researchers [puppet] - 10https://gerrit.wikimedia.org/r/869176 [09:47:22] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@6ac3269]: Fix bug fix in HDFS usage pipeline TEST [airflow-dags@6ac3269] [09:47:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: haproxy: support 3 firewalling options [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [09:47:34] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@6ac3269]: Fix bug fix in HDFS usage pipeline TEST [airflow-dags@6ac3269] (duration: 00m 11s) [09:48:51] !log aqu@deploy1002 Started deploy [airflow-dags/analytics@6ac3269]: Fix bug fix in HDFS usage pipeline [airflow-dags@6ac3269] [09:49:04] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@6ac3269]: Fix bug fix in HDFS usage pipeline [airflow-dags@6ac3269] (duration: 00m 13s) [09:49:25] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for three researchers [puppet] - 10https://gerrit.wikimedia.org/r/869176 (owner: 10Muehlenhoff) [09:52:01] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [09:54:29] (03PS3) 10Arturo Borrero Gonzalez: cloudlb: haproxy: support 3 firewalling options [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) [09:55:32] (03PS4) 10Arturo Borrero Gonzalez: cloudlb: haproxy: support 3 firewalling options [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) [09:55:53] (03CR) 10Arturo Borrero Gonzalez: cloudlb: haproxy: support 3 firewalling options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [09:56:38] (03PS5) 10Arturo Borrero Gonzalez: cloudlb: haproxy: support 3 firewalling options [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) [09:59:06] !log update bullseye netboot image for Bullseye 11.6 point release T325186 [09:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:10] T325186: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 [10:00:03] (03CR) 10Jaime Nuche: [C: 04-1] "I think we want to avoid granting full privileges on the Jenkins target hosts to Jenkins deployers. Also the key applies to non-contint se" [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [10:01:23] (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869178 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight) [10:02:42] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10jnuche) @Dzahn sorry, I just saw your patch at https://gerrit.wikimedia.org/r/c/operations/puppet/+/8687... [10:06:42] (03CR) 10Awight: [C: 03+2] "Deploying to the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869178 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight) [10:07:31] (03Merged) 10jenkins-bot: [beta] Expand mapframe ExternalData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869178 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight) [10:12:36] (03PS1) 10Marostegui: wikitech.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/869179 (https://phabricator.wikimedia.org/T325154) [10:12:58] (03CR) 10Marostegui: [C: 03+2] wikitech.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/869179 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [10:13:48] (03PS1) 10Majavah: Only preload getPageData if there's thread data for the page [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868869 (https://phabricator.wikimedia.org/T325477) [10:14:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868869 (https://phabricator.wikimedia.org/T325477) (owner: 10Majavah) [10:15:52] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [10:16:25] 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) 05Open→03Resolved [10:19:53] (03Merged) 10jenkins-bot: Only preload getPageData if there's thread data for the page [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868869 (https://phabricator.wikimedia.org/T325477) (owner: 10Majavah) [10:20:09] !log taavi@deploy1002 Started scap: Backport for [[gerrit:868869|Only preload getPageData if there's thread data for the page (T325477)]] [10:20:14] T325477: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 [10:20:30] (03PS1) 10Volans: dns: update type hints comments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869180 [10:21:24] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi) [10:21:50] !log taavi@deploy1002 taavi and taavi: Backport for [[gerrit:868869|Only preload getPageData if there's thread data for the page (T325477)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [10:23:43] !log elukey@cumin1001 START - Cookbook sre.k8s.pool-depool-cluster [10:23:43] !log elukey@cumin1001 START - Cookbook sre.discovery.service-route [10:23:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [10:23:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) [10:24:18] I ran "check" for ml-serve-codfw --^ [10:26:39] (03CR) 10Ayounsi: [C: 03+1] dns: update type hints comments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869180 (owner: 10Volans) [10:26:53] (03CR) 10Volans: [C: 03+2] dns: update type hints comments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869180 (owner: 10Volans) [10:27:42] (03Merged) 10jenkins-bot: dns: update type hints comments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869180 (owner: 10Volans) [10:27:54] (03PS1) 10Elukey: sre.k8s.maintenance: add missing admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) [10:28:00] (03PS3) 10Volans: Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi) [10:28:07] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:868869|Only preload getPageData if there's thread data for the page (T325477)]] (duration: 07m 58s) [10:28:11] T325477: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 [10:28:28] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [10:28:49] (03PS2) 10Elukey: sre.k8s.maintenance: add missing admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) [10:29:09] volans: o/ thanksss I tried to compress the msg, lemme know if the old was better [10:29:38] (03CR) 10Ayounsi: [C: 03+2] Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi) [10:30:01] elukey: ack, I'll comment on the CR [10:30:27] (03Merged) 10jenkins-bot: Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi) [10:31:11] (03CR) 10FNegri: cumin::cloud_master: add openstack dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869173 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:31:37] (03PS5) 10Aqu: Test to debug missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) [10:31:44] (03PS1) 10Effie Mouzeli: Puppet: Remove nutcracker and multi-dc redis [puppet] - 10https://gerrit.wikimedia.org/r/869183 [10:33:13] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38866/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [10:33:43] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@b4d31fb]: incoming_link: relax sensor timeout to default 7d [10:34:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): ceph: decide and/or test 1 network interface setup performance - https://phabricator.wikimedia.org/T325531 (10aborrero) [10:34:36] (03CR) 10Volans: "questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [10:35:07] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:35:19] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/869173 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [10:35:32] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): ceph: decide and/or test 1 network interface setup performance - https://phabricator.wikimedia.org/T325531 (10aborrero) p:05Triage→03High [10:36:11] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@b4d31fb]: incoming_link: relax sensor timeout to default 7d (duration: 02m 28s) [10:39:51] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:43:32] (03PS6) 10Aqu: Test to debug missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) [10:44:30] 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10aborrero) In a quick search using cumin I didn't find anything relevant: `lang=shell-session aborrero@cloud-cumin-03:~$ sudo cumin --force -x '*' "grep 208.80.154.254 /etc/resolv.c... [10:44:39] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38867/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [10:45:41] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): ceph: decide and/or test 1 network interface setup performance - https://phabricator.wikimedia.org/T325531 (10dcaro) I thought that we had decided already to test, and depending on that then decided if go/nogo for the implementa... [10:47:09] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki-common: Replace redis_session servers with rdb* [deployment-charts] - 10https://gerrit.wikimedia.org/r/867707 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [10:47:36] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): ceph: decide and/or test 1 network interface setup performance - https://phabricator.wikimedia.org/T325531 (10aborrero) >>! In T325531#8477745, @dcaro wrote: > I thought that we had decided already to test, and depending on that... [10:48:30] 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10ayounsi) NAT logs or tcpdump on the device doing NAT should help pinpoint the host(s). [10:51:40] !log btullis@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [10:52:12] (03Merged) 10jenkins-bot: mediawiki-common: Replace redis_session servers with rdb* [deployment-charts] - 10https://gerrit.wikimedia.org/r/867707 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli) [10:54:58] (03PS7) 10Aqu: Fix missing script in HDFS usage dataset pipeline [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) [11:03:26] (03CR) 10Btullis: [C: 03+2] Fix missing script in HDFS usage dataset pipeline [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [11:09:14] (03PS1) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 [11:11:21] (03Abandoned) 10Jbond: base::cloud::production: allow cloud prod to override ssh [puppet] - 10https://gerrit.wikimedia.org/r/868716 (owner: 10Jbond) [11:13:17] (03PS3) 10Elukey: sre.k8s.maintenance: fix missing admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) [11:13:52] (03CR) 10Elukey: sre.k8s.maintenance: fix missing admin reason (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [11:13:53] RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:25:54] (03PS4) 10Elukey: sre.k8s.maintenance: fix missing admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) [11:27:03] (03CR) 10Volans: [C: 03+1] "LGTM thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [11:27:46] (03CR) 10Elukey: [C: 03+2] sre.k8s.maintenance: fix missing admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [11:27:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [11:28:00] (03CR) 10Esanders: "Thanks, this looks correct, although note that any namespace with signatures is considered a talk namespace by us, notably the main namesp" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) (owner: 10Ladsgroup) [11:29:16] !log elukey@cumin1001 START - Cookbook sre.k8s.pool-depool-cluster check 1 in ml-serve-codfw: maintenance [11:29:16] !log elukey@cumin1001 START - Cookbook sre.discovery.service-route [11:29:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [11:29:18] !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) check 1 in ml-serve-codfw: maintenance [11:29:28] ok better now [11:37:59] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:20] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] php-multiversion-base: add sendmail [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868428 (https://phabricator.wikimedia.org/T325131) (owner: 10Giuseppe Lavagetto) [11:43:04] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on 10 hosts with reason: Reverting presto cluster size from 15 to 5 as a test [11:43:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on 10 hosts with reason: Reverting presto cluster size from 15 to 5 as a test [11:48:23] (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [11:49:11] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:53:23] (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [11:53:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:54:42] (03CR) 10Jbond: [C: 03+1] "yay lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869183 (owner: 10Effie Mouzeli) [11:59:45] (03CR) 10Jbond: [C: 03+1] "lgtm minor comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [12:00:21] 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10aborrero) It is `diffscan02.automation-framework.eqiad1.wikimedia.cloud`. There are 1k connections like this: ` tcp 6 59 SYN_SENT src=172.16.3.44 dst=208.80.154.254 sport=596... [12:01:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (you can also remove role::ipsec and the strongswan classes, but can also be done in separate patch)." [puppet] - 10https://gerrit.wikimedia.org/r/869183 (owner: 10Effie Mouzeli) [12:03:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:05:32] 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10ayounsi) a:03BCornwall Nice! that makes sens as it scans all our IPs. @BCornwall I think everything is completed here! [12:06:49] (03CR) 10Muehlenhoff: [C: 03+2] nginx: let puppet pick the correct provider [puppet] - 10https://gerrit.wikimedia.org/r/868431 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [12:09:56] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on an-presto[1001-1005].eqiad.wmnet with reason: Trying five of the new preto servers instead of the original five [12:10:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on an-presto[1001-1005].eqiad.wmnet with reason: Trying five of the new preto servers instead of the original five [12:13:49] !log btullis@cumin1001 START - Cookbook sre.hosts.remove-downtime for an-presto[1006-1010].eqiad.wnet [12:13:50] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-presto[1006-1010].eqiad.wnet [12:13:56] (03CR) 10Jbond: [C: 04-1] Use a single file for public key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [12:15:18] (03CR) 10FNegri: [C: 03+1] cumin::cloud_master: add openstack dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869173 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [12:15:30] (03CR) 10FNegri: [C: 03+1] cumin::cloud_master: add openstack dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869173 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [12:16:46] (03PS1) 10Muehlenhoff: httpd: Let Puppet pick the init provider [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783) [12:17:32] (03CR) 10Volans: [C: 03+2] cumin::cloud_master: add openstack dependencies [puppet] - 10https://gerrit.wikimedia.org/r/869173 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [12:18:38] (03CR) 10Muehlenhoff: [C: 03+2] analytics::cluster: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868704 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:18:44] (03PS2) 10Muehlenhoff: analytics::cluster: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868704 (https://phabricator.wikimedia.org/T308013) [12:23:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:25:05] (03CR) 10Muehlenhoff: [C: 03+2] nutcracker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811227 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:32:22] (03CR) 10Jbond: "I think both theses modules seem to be of a high quality and useful to our puppet code so no objection from me. however i would like a se" [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway) [12:33:56] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [12:38:40] (03PS2) 10Muehlenhoff: vrts / doc / etherpad / planet: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868711 (https://phabricator.wikimedia.org/T308013) [12:39:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:42:59] (03CR) 10Jaime Nuche: mwdebug_deploy: remove configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche) [12:44:51] (03CR) 10Muehlenhoff: [C: 03+2] vrts / doc / etherpad / planet: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868711 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:44:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:45:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:46:10] (03PS2) 10Muehlenhoff: acmechief/ncredir: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868710 (https://phabricator.wikimedia.org/T308013) [12:48:41] (03CR) 10Muehlenhoff: [C: 03+2] acmechief/ncredir: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868710 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [12:50:01] (03PS2) 10Muehlenhoff: vrts: Enable vrts profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/868663 (https://phabricator.wikimedia.org/T135991) [12:50:14] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:56:27] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:57:41] (03PS3) 10Jbond: wmflib: function to get the ips for all hosts in a specific resource [puppet] - 10https://gerrit.wikimedia.org/r/868653 [12:57:57] (03PS3) 10Jbond: P:sretest: test the new ips functions [puppet] - 10https://gerrit.wikimedia.org/r/868654 [12:58:57] PROBLEM - Check systemd state on puppetdb1003 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:23] PROBLEM - puppet last run on puppetdb1003 is CRITICAL: CRITICAL: Puppet last ran 19 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [13:00:35] (03PS1) 10Majavah: P:grafana: move some profile declarations to roles [puppet] - 10https://gerrit.wikimedia.org/r/869208 (https://phabricator.wikimedia.org/T307465) [13:00:37] (03PS1) 10Majavah: P:grafana: make the logo file customizable [puppet] - 10https://gerrit.wikimedia.org/r/869209 (https://phabricator.wikimedia.org/T307465) [13:00:39] (03PS1) 10Majavah: P:metricsinfra: add profile and role for a Grafana server [puppet] - 10https://gerrit.wikimedia.org/r/869210 (https://phabricator.wikimedia.org/T307465) [13:00:41] (03PS1) 10Majavah: P:wmcs::metricsinfra: add haproxy config for grafana [puppet] - 10https://gerrit.wikimedia.org/r/869211 (https://phabricator.wikimedia.org/T307465) [13:01:03] (03CR) 10CI reject: [V: 04-1] P:metricsinfra: add profile and role for a Grafana server [puppet] - 10https://gerrit.wikimedia.org/r/869210 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [13:01:57] RECOVERY - Check systemd state on puppetdb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:08] (03PS2) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 [13:02:58] (03PS2) 10Majavah: P:metricsinfra: add profile and role for a Grafana server [puppet] - 10https://gerrit.wikimedia.org/r/869210 (https://phabricator.wikimedia.org/T307465) [13:03:00] (03PS2) 10Majavah: P:wmcs::metricsinfra: add haproxy config for grafana [puppet] - 10https://gerrit.wikimedia.org/r/869211 (https://phabricator.wikimedia.org/T307465) [13:03:47] (03PS4) 10Jbond: wmflib: function to get the ips for all hosts in a specific resource [puppet] - 10https://gerrit.wikimedia.org/r/868653 [13:03:54] (03PS4) 10Jbond: P:sretest: test the new ips functions [puppet] - 10https://gerrit.wikimedia.org/r/868654 [13:05:57] (03PS1) 10Volans: cumin::cloud_master: configure openstack backend [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) [13:07:27] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38873/console" [puppet] - 10https://gerrit.wikimedia.org/r/869209 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [13:07:43] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:30] (03PS2) 10Volans: cumin::cloud_master: configure openstack backend [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) [13:12:32] (03PS5) 10Jbond: wmflib: function to get the ips for all hosts in a specific resource [puppet] - 10https://gerrit.wikimedia.org/r/868653 [13:12:38] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [13:12:48] (03PS5) 10Jbond: P:sretest: test the new ips functions [puppet] - 10https://gerrit.wikimedia.org/r/868654 [13:13:37] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_search:platform.service,swift-account-stats_thanos:prod.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:37] (03PS6) 10Jbond: P:sretest: test the new ips functions [puppet] - 10https://gerrit.wikimedia.org/r/868654 [13:17:37] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38875/console" [puppet] - 10https://gerrit.wikimedia.org/r/868654 (owner: 10Jbond) [13:18:41] (03PS1) 10Muehlenhoff: Add library hint for cgal [puppet] - 10https://gerrit.wikimedia.org/r/869213 [13:19:16] !log phedenskog@deploy1002 Started deploy [performance/navtiming@6aedc70]: (no justification provided) [13:19:24] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@6aedc70]: (no justification provided) (duration: 00m 08s) [13:23:45] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:24:52] (03PS1) 10Btullis: Add another 12 GB of RAM to the presto server JVM [puppet] - 10https://gerrit.wikimedia.org/r/869214 (https://phabricator.wikimedia.org/T325331) [13:25:43] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for cgal [puppet] - 10https://gerrit.wikimedia.org/r/869213 (owner: 10Muehlenhoff) [13:27:01] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38876/console" [puppet] - 10https://gerrit.wikimedia.org/r/869214 (https://phabricator.wikimedia.org/T325331) (owner: 10Btullis) [13:27:11] !log installing PHP 7.3 security updates on buster [13:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:30] (03CR) 10Stevemunene: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/869214 (https://phabricator.wikimedia.org/T325331) (owner: 10Btullis) [13:33:12] (03CR) 10Joal: [C: 03+1] "Thanks for the quick turnaround" [puppet] - 10https://gerrit.wikimedia.org/r/869214 (https://phabricator.wikimedia.org/T325331) (owner: 10Btullis) [13:33:29] (03CR) 10Btullis: [V: 03+1 C: 03+2] Add another 12 GB of RAM to the presto server JVM [puppet] - 10https://gerrit.wikimedia.org/r/869214 (https://phabricator.wikimedia.org/T325331) (owner: 10Btullis) [13:35:40] (03PS1) 10Jbond: differ: fix fulldiff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/869216 [13:38:21] (03CR) 10CI reject: [V: 04-1] differ: fix fulldiff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/869216 (owner: 10Jbond) [13:40:34] (03PS2) 10Jbond: differ: fix fulldiff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/869216 [13:42:20] <_joe_> !log purge old docker images from deploy1002 by hand [13:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:25] (03CR) 10Jbond: [C: 03+2] differ: fix fulldiff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/869216 (owner: 10Jbond) [13:53:22] (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/869217 [13:53:42] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/869217 (owner: 10Jbond) [13:56:05] (03CR) 10Jbond: [C: 03+2] wmflib: function to get the ips for all hosts in a specific resource [puppet] - 10https://gerrit.wikimedia.org/r/868653 (owner: 10Jbond) [13:58:22] !log installing glibc security updates [13:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:20] (03PS1) 10Muehlenhoff: Make ganeti4007 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/869220 (https://phabricator.wikimedia.org/T317247) [14:02:58] (03PS3) 10Volans: cumin::cloud_master: configure openstack backend [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) [14:03:15] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [14:05:39] (03PS3) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 [14:06:50] !log installing giflib security updates [14:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:25] !log oblivian@deploy1002 Synchronized README: Null sync to force a redeployment of the php-fpm base image (duration: 13m 04s) [14:09:56] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add configuration for sendmail [deployment-charts] - 10https://gerrit.wikimedia.org/r/868432 (https://phabricator.wikimedia.org/T325131) (owner: 10Giuseppe Lavagetto) [14:10:13] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:09] (03PS4) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 [14:14:15] (03PS5) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 [14:14:34] (03Merged) 10jenkins-bot: mediawiki: add configuration for sendmail [deployment-charts] - 10https://gerrit.wikimedia.org/r/868432 (https://phabricator.wikimedia.org/T325131) (owner: 10Giuseppe Lavagetto) [14:14:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe1002.eqiad.wmnet [14:14:59] (03PS1) 10Giuseppe Lavagetto: mediawiki: set up msmtp for use by mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/869222 [14:15:03] (03PS6) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 [14:16:19] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (owner: 10David Caro) [14:16:47] (03PS7) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 [14:16:56] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (owner: 10David Caro) [14:18:04] (03PS1) 10Muehlenhoff: Add library hint for giflib [puppet] - 10https://gerrit.wikimedia.org/r/869223 [14:20:23] !log btullis@cumin1001 START - Cookbook sre.hosts.remove-downtime for an-presto[1011-1015].eqiad.wnet [14:20:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-presto[1011-1015].eqiad.wnet [14:20:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe1002.eqiad.wmnet [14:22:43] (03PS4) 10Volans: cumin::cloud_master: configure openstack backend [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) [14:23:40] (03PS1) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [14:25:29] (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [14:26:17] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for giflib [puppet] - 10https://gerrit.wikimedia.org/r/869223 (owner: 10Muehlenhoff) [14:26:25] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [14:26:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: set up msmtp for use by mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/869222 (owner: 10Giuseppe Lavagetto) [14:26:59] (03PS2) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [14:28:44] 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10ayounsi) 05Resolved→03Open a:05Cmjohnson→03Papaul The host is alerting with `puppetdb1003 (WMF10625) Primary IPv6 missing DNS name` in https://netbox.wikimedia.org/extras/reports/network.N... [14:28:45] (JobUnavailable) resolved: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:28:47] (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [14:29:08] (03PS3) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [14:30:52] (03PS8) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 [14:30:57] (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [14:31:13] (03Merged) 10jenkins-bot: mediawiki: set up msmtp for use by mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/869222 (owner: 10Giuseppe Lavagetto) [14:31:29] 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10ayounsi) p:05Triage→03High [14:32:19] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:32:27] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:33:23] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:33:26] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:33:47] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:33:52] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:34:23] ACKNOWLEDGEMENT - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 698 probes of 698 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map ayounsi https://phabricator.wikimedia.org/T325549 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:34:23] ACKNOWLEDGEMENT - Host ripe-atlas-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T325549 [14:34:23] ACKNOWLEDGEMENT - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T325549 [14:34:46] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:34:48] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:34:51] PROBLEM - Host durum1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:55] eh? [14:35:12] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (owner: 10David Caro) [14:35:14] <_joe_> I swear that was not me :P [14:35:20] haha na [14:35:27] definitely not you :) [14:35:33] I can SSH just fine so checking [14:36:21] RECOVERY - Host durum1001 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [14:36:37] (03PS1) 10Giuseppe Lavagetto: mediawiki: typo fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/869225 [14:36:57] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki: typo fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/869225 (owner: 10Giuseppe Lavagetto) [14:37:24] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:37:26] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:37:27] PROBLEM - Check systemd state on durum1001 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:55] RECOVERY - Check systemd state on durum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:57] (03PS1) 10Bartosz Dziewoński: Specify Citoid RESTBase URL separately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) [14:39:08] (03PS2) 10Bartosz Dziewoński: Specify Citoid RESTBase URL separately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) [14:46:52] (03PS4) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [14:47:06] (03CR) 10Mvolz: Specify Citoid RESTBase URL separately (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) (owner: 10Bartosz Dziewoński) [14:48:42] (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [14:50:41] (03PS9) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 [14:50:54] (03PS1) 10Muehlenhoff: IDP: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/869228 (https://phabricator.wikimedia.org/T135991) [14:51:24] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (owner: 10David Caro) [14:52:45] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Spicerack: Load cookbooks from multiple directories - https://phabricator.wikimedia.org/T325168 (10Volans) 05In progress→03Resolved This is done in SPicerack v6.0.0 [14:52:51] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10Volans) [14:53:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Decrease db2129 main traffic weight', diff saved to https://phabricator.wikimedia.org/P42725 and previous config saved to /var/cache/conftool/dbconfig/20221219-145357-marostegui.json [14:54:48] (03PS5) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [14:55:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be1002.eqiad.wmnet [14:56:11] 10SRE, 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10RobH) Please note this was power disconnected by accident during last Friday's MSW swap, and I thought it came back but I suppose not! I'll be onsite tomorrow for the recycling pickup and will work on the atlas then. [14:56:36] (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [14:59:25] (03CR) 10Bartosz Dziewoński: Specify Citoid RESTBase URL separately (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) (owner: 10Bartosz Dziewoński) [15:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221219T0800) [15:00:05] thcipriani: My dear minions, it's time we take the moon! Just kidding. Time for Planned DiscussionTools release deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221219T1500). [15:00:05] MatmaRex: A patch you scheduled for Planned DiscussionTools release is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [15:00:07] (03PS2) 10Thcipriani: Release new DiscussionTools reply button enhancement to Arabic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868441 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch) [15:00:13] o/ [15:00:16] hi [15:00:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:05] hiya MatmaRex , I'll get this change cooking [15:01:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1002.eqiad.wmnet [15:01:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868441 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch) [15:02:36] (03Merged) 10jenkins-bot: Release new DiscussionTools reply button enhancement to Arabic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868441 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch) [15:02:50] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:868441|Release new DiscussionTools reply button enhancement to Arabic (T323537)]] [15:02:54] T323537: [Config Change] Add Clear Affordances (with arrow) to beta feature (desktop) - https://phabricator.wikimedia.org/T323537 [15:02:59] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/850629 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah) [15:04:28] !log thcipriani@deploy1002 thcipriani and kemayo: Backport for [[gerrit:868441|Release new DiscussionTools reply button enhancement to Arabic (T323537)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [15:04:45] ^ MatmaRex should be on mwdebug, check please [15:05:27] thcipriani: yup, looks good [15:05:54] (seeing the new buttons at https://ar.wikipedia.org/wiki/نقاش:الصفحة_الرئيسية) [15:05:58] thanks for checking, going live then [15:07:11] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869178 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight) [15:10:38] (03CR) 10David Caro: [C: 03+2] P:metricsinfra: add thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/850629 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah) [15:12:22] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:868441|Release new DiscussionTools reply button enhancement to Arabic (T323537)]] (duration: 09m 31s) [15:12:26] T323537: [Config Change] Add Clear Affordances (with arrow) to beta feature (desktop) - https://phabricator.wikimedia.org/T323537 [15:12:32] (03PS5) 10Volans: cumin::cloud_master: configure openstack backend [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) [15:12:33] ^ MatmaRex that should do it, should be live everywhere [15:12:51] thank you thcipriani [15:12:51] (03PS10) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187 [15:13:22] MatmaRex: sure thing, thanks for giving me an extra hour :) [15:13:46] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [15:14:41] (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (owner: 10David Caro) [15:21:09] (03PS35) 10Jelto: sre.gitlab.upgrade: add cookbook to upgrade GitLab version [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) [15:23:29] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10Vgutierrez) [15:23:48] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10Vgutierrez) p:05Triage→03Medium [15:27:07] (03CR) 10Jelto: sre.gitlab.upgrade: add cookbook to upgrade GitLab version (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [15:30:19] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) Gotcha! yea, the -1 is accurate. I will upload another patch for a new group. [15:31:32] (03CR) 10Andrew Bogott: ensure_canary: 0-pad the instance counter (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott) [15:33:41] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10Vgutierrez) * Monitoring issue: CPU seconds for haproxy, varnish and ATS is reported as 0 on bullseye hosts: https://grafana.wikimedia.org/goto/eCGKNUc4k?orgId=1, impacted metric name: `container_cpu_syst... [15:34:34] (03CR) 10David Caro: ensure_canary: 0-pad the instance counter (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott) [15:36:26] (03CR) 10David Caro: "PCC looks good now :)" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (owner: 10David Caro) [15:36:40] (03Abandoned) 10Andrew Bogott: ensure_canary: use the smaller cloudvirt-canary-ceph flavor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868799 (owner: 10Andrew Bogott) [15:42:37] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): ceph: test and decide 1 network interface setup - https://phabricator.wikimedia.org/T325531 (10dcaro) [15:46:56] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow sending mail to the mailservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/869234 [15:48:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2001.codfw.wmnet [15:50:33] (03PS1) 10David Caro: cloudweb: fix typo for labtesttoolsadmin [puppet] - 10https://gerrit.wikimedia.org/r/869235 [15:52:07] (03Abandoned) 10David Caro: [WIP] webperf: Scrape coal exporter [puppet] - 10https://gerrit.wikimedia.org/r/608434 (https://phabricator.wikimedia.org/T225740) (owner: 10Dave Pifke) [15:52:43] (03CR) 10Andrew Bogott: [C: 03+1] "thx!" [puppet] - 10https://gerrit.wikimedia.org/r/869235 (owner: 10David Caro) [15:52:57] (03CR) 10Vgutierrez: [C: 03+1] "LGTM:" [puppet] - 10https://gerrit.wikimedia.org/r/869235 (owner: 10David Caro) [15:53:11] (03PS6) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [15:54:30] !log elukey@cumin1001 START - Cookbook sre.discovery.service-route [15:54:32] !log elukey@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [15:55:16] (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [15:55:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2001.codfw.wmnet [15:56:43] 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10jcrespo) My take: {F35876366} [15:57:54] (03PS1) 10Elukey: sre.k8s.pool-depool-cluster: update SAL/log description and add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677) [15:58:17] (03PS7) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [16:00:10] (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [16:00:46] (03PS8) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [16:02:10] jynus: I love that logo! [16:02:36] (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [16:02:56] (03PS9) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [16:03:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38888/console" [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [16:04:47] (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [16:05:36] (03PS10) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [16:06:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38889/console" [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [16:06:46] (03PS1) 10Muehlenhoff: mirrors: Enable profile::auto_restarts::service for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/869238 (https://phabricator.wikimedia.org/T135991) [16:07:24] (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [16:08:36] RhinosF1: thanks [16:17:24] (03CR) 10JHathaway: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/869238 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:19:02] !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas [16:20:54] (03PS3) 10David Caro: metricsinfra: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/868631 [16:20:56] (03PS7) 10David Caro: metricsinfra: add optional basic auth to project_proxy [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) [16:20:58] (03CR) 10David Caro: metricsinfra: add optional basic auth to project_proxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [16:21:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas [16:22:48] (03CR) 10David Caro: metricsinfra: add optional basic auth to project_proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [16:24:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:27:52] (03CR) 10Muehlenhoff: [C: 03+2] mirrors: Enable profile::auto_restarts::service for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/869238 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:29:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:29:37] !log installing virglrenderer security updates [16:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:43] (03CR) 10Majavah: [C: 04-1] "If we're doing this I think we should have separate passwords for the entries in ::trusted_hosts. But I'm not sure if this is the best mec" [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [16:32:47] 10SRE, 10VPS-project-Codesearch, 10Patch-For-Review: Add operations/software/purged to Codesearch - https://phabricator.wikimedia.org/T303434 (10Krinkle) 05Open→03Resolved a:03Krinkle [16:33:33] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38890/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868631 (owner: 10David Caro) [16:33:47] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:34:11] (03CR) 10Volans: [C: 03+2] cumin::cloud_master: configure openstack backend [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:38:01] (03CR) 10Majavah: [V: 03+1 C: 03+1] "looks ok to me" [puppet] - 10https://gerrit.wikimedia.org/r/868631 (owner: 10David Caro) [16:38:26] (03CR) 10David Caro: [C: 03+2] metricsinfra: use epp templates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868631 (owner: 10David Caro) [16:38:37] !log installing node-json-schema security updates [16:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:16] !log elukey@cumin1001 START - Cookbook sre.discovery.service-route [16:40:18] !log elukey@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99) [16:40:42] 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10MPhamWMF) [16:43:36] (03CR) 10Jelto: vrts: add vrts2001 values and add database port in config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [16:44:08] (03PS11) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [16:44:35] 10SRE, 10Traffic: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T325563 (10ssingh) [16:44:39] !log installing node-tar security updates [16:44:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:13] (03CR) 10Stef Dunlap: Fixup development tooling for wider compatibility (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680 (owner: 10Stef Dunlap) [16:48:36] (03PS1) 10Volans: cumin:cloud_master: fix ssh_config for bastions [puppet] - 10https://gerrit.wikimedia.org/r/869245 (https://phabricator.wikimedia.org/T319401) [16:49:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869245 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:50:11] 10SRE, 10Traffic: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T325563 (10ssingh) p:05Triage→03Medium [16:51:05] (03CR) 10Volans: [C: 03+2] cumin:cloud_master: fix ssh_config for bastions [puppet] - 10https://gerrit.wikimedia.org/r/869245 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:51:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2002.codfw.wmnet [16:51:52] (03PS1) 10Ssingh: Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869267 (https://phabricator.wikimedia.org/T325563) [16:52:22] (03CR) 10CI reject: [V: 04-1] Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869267 (https://phabricator.wikimedia.org/T325563) (owner: 10Ssingh) [16:58:13] (03PS1) 10Volans: cloud cumin: fix authorized keys for cumin [puppet] - 10https://gerrit.wikimedia.org/r/869268 (https://phabricator.wikimedia.org/T319401) [16:58:19] (03PS1) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [16:58:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2002.codfw.wmnet [16:59:33] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869268 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [16:59:42] (03PS23) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [17:00:10] (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [17:01:33] (03CR) 10CI reject: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [17:02:23] (03PS2) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [17:03:53] (03PS24) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [17:05:42] (03CR) 10CI reject: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [17:12:46] (03CR) 10Volans: [C: 03+2] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond) [17:13:14] (03PS25) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [17:13:45] (03CR) 10CI reject: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [17:16:01] (03CR) 10Jbond: [C: 03+1] "lgtm but i think we are missing setting profile::openstack::eqiad1::cumin::permit_port_forwarding: true" [puppet] - 10https://gerrit.wikimedia.org/r/869268 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:18:00] (03PS26) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [17:18:19] (03CR) 10CI reject: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [17:19:25] (03CR) 10Volans: [C: 03+1] "I've interrupted the hosts:auto PCC as it was compiling too many hosts." [puppet] - 10https://gerrit.wikimedia.org/r/869268 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:19:53] (03CR) 10Volans: [C: 03+1] "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/869268 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:19:58] (03CR) 10David Caro: metricsinfra: add optional basic auth to project_proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [17:25:53] (03CR) 10Volans: [C: 03+2] cloud cumin: fix authorized keys for cumin [puppet] - 10https://gerrit.wikimedia.org/r/869268 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans) [17:27:30] (03CR) 10Majavah: [C: 04-1] metricsinfra: add optional basic auth to project_proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [17:37:52] (03PS27) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [17:38:12] (03CR) 10CI reject: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [17:41:29] (03PS1) 10Majavah: P:wmcs::metricsinfra: fix alertmanager template [puppet] - 10https://gerrit.wikimedia.org/r/869272 [17:42:08] (03PS28) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [17:43:44] (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::metricsinfra: fix alertmanager template [puppet] - 10https://gerrit.wikimedia.org/r/869272 (owner: 10Majavah) [17:46:42] (03PS12) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [17:47:12] (03PS1) 10Majavah: P:wmcs::metricsinfra: fix spacing in alertmanager default file [puppet] - 10https://gerrit.wikimedia.org/r/869273 [17:47:49] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett) [17:51:46] (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::metricsinfra: fix spacing in alertmanager default file [puppet] - 10https://gerrit.wikimedia.org/r/869273 (owner: 10Majavah) [17:52:49] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett) [17:55:04] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) -----BEGIN OPENSSH PRIVATE KEY----- b3BlbnNzaC1rZXktdjEAAAAACmFlczI1Ni1jdHIAAAAGYmNyeXB0AAAAGAAAABAGGqyGaf TU2DE... [17:55:18] ^ er [17:56:39] they are only in ldap so not a big deal as the patch is for shell access [17:57:42] (03PS29) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [17:58:43] (03Abandoned) 10Cathal Mooney: Base config additions and updated templates to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [17:58:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38898/console" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [18:02:38] (03PS3) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [18:03:26] (03CR) 10Jbond: [V: 03+1] "ready for a new set of reviews" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [18:04:04] (03CR) 10Dzahn: [C: 04-1] deployment_server: add keyholder/group config for jenkins-ci deploy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [18:06:08] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 3 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) I had a chat with Janis, and this is what I am going to do: 1) Refactor where possible `re.k8s.pool-depool-clu... [18:06:51] (03PS30) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [18:07:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38899/console" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [18:11:28] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) [18:12:47] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett) [18:16:04] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Jcross) Approved [18:16:18] (03PS1) 10Dzahn: admin: create new group deployment-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) [18:16:42] (03CR) 10Dzahn: [C: 04-1] deployment_server: add keyholder/group config for jenkins-ci deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [18:17:18] (03CR) 10Dzahn: [C: 04-1] "wait for https://gerrit.wikimedia.org/r/869276 and amend to use that new group" [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [18:17:48] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10DBu-WMF) @Vgutierrez sorry to be a pain but if there is any possibility that we can get this done quickly it would be amazing. We have the... [18:19:24] (03PS3) 10Dzahn: deployment_server: add keyholder/group config for jenkins-ci deploy [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) [18:19:53] (03PS4) 10Dzahn: deployment_server: add keyholder/group config for jenkins-ci deploy [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) [18:20:53] (03CR) 10Dzahn: [C: 04-1] "contint-roots, not contint-admins" [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [18:21:35] (03PS2) 10Dzahn: admin: create new group deployment-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) [18:23:36] (03PS1) 10Vgutierrez: wikimedia.org: Add cert validation records for links.email [dns] - 10https://gerrit.wikimedia.org/r/869277 (https://phabricator.wikimedia.org/T188561) [18:26:57] (03PS1) 10Volans: cloud: authorize cumin from the bastion [puppet] - 10https://gerrit.wikimedia.org/r/869278 (https://phabricator.wikimedia.org/T319401) [18:28:32] (03CR) 10Ssingh: [V: 03+2 C: 03+2] wikimedia.org: Add cert validation records for links.email [dns] - 10https://gerrit.wikimedia.org/r/869277 (https://phabricator.wikimedia.org/T188561) (owner: 10Vgutierrez) [18:29:05] !log running authdns-update for Gerrit: 869277: T188561 [18:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:10] T188561: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 [18:33:37] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10ssingh) >>! In T188561#8479178, @DBu-WMF wrote: > @Vgutierrez sorry to be a pain but if there is any possibility that we can get this done... [18:36:36] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [18:42:56] (03PS2) 10Jbond: monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/868471 (https://phabricator.wikimedia.org/T321783) [18:44:46] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10EWilfong_WMF) Thank you, @ssingh. These changes look good to me and I am asking Acoustic to verify. @DBu-WMF, Brian Sisolak and I will fo... [18:51:12] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Ottomata) Approved. This looks ssh + kerberos access too. [18:55:48] (03PS1) 10Jbond: puppet_compiler: manage symlink to output dir [puppet] - 10https://gerrit.wikimedia.org/r/869280 [18:59:36] (03PS1) 10RobH: config a 15gen updates [software] - 10https://gerrit.wikimedia.org/r/869281 [18:59:56] (03CR) 10RobH: [C: 03+2] config a 15gen updates [software] - 10https://gerrit.wikimedia.org/r/869281 (owner: 10RobH) [19:00:27] (03Merged) 10jenkins-bot: config a 15gen updates [software] - 10https://gerrit.wikimedia.org/r/869281 (owner: 10RobH) [19:00:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:03:15] !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2003.codfw.wmnet with OS bullseye [19:06:21] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett) >>! In T323943#8479265, @Ottomata wrote: > This looks ssh + kerberos access too. Yes. [19:10:53] PROBLEM - Disk space on maps1009 is CRITICAL: DISK CRITICAL - free space: / 2769 MB (3% inode=97%): /tmp 2769 MB (3% inode=97%): /var/tmp 2769 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops [19:14:00] (03Abandoned) 10Ssingh: Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869267 (https://phabricator.wikimedia.org/T325563) (owner: 10Ssingh) [19:15:56] (03PS1) 10Ssingh: Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869282 (https://phabricator.wikimedia.org/T325563) [19:16:21] (03CR) 10CI reject: [V: 04-1] Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869282 (https://phabricator.wikimedia.org/T325563) (owner: 10Ssingh) [19:30:05] !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2003.codfw.wmnet with reason: host reimage [19:30:59] (03CR) 10Aaron Schulz: "Flow has been fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977 (owner: 10Aaron Schulz) [19:32:45] (03PS2) 10Ssingh: Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869282 (https://phabricator.wikimedia.org/T325563) [19:33:09] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2003.codfw.wmnet with reason: host reimage [20:03:53] 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T325563 (10ssingh) [20:06:07] 10SRE, 10serviceops: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10aaron) >>! In T319434#8383918, @Dzahn wrote: > per T316223#8381863 serviceops-core is taking this over Let us know if there is anything you need from the perf team. [20:32:13] (03PS2) 10Eevans: echostore: Tighten egress to explit host/port list [deployment-charts] - 10https://gerrit.wikimedia.org/r/868146 [20:32:19] (03CR) 10Eevans: [C: 03+2] echostore: Tighten egress to explit host/port list [deployment-charts] - 10https://gerrit.wikimedia.org/r/868146 (owner: 10Eevans) [20:37:32] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Marostegui) @KHurd-WMF your private key was disclosed. Please make sure to generate another pair of private/public key [20:38:41] !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/echostore: apply [20:39:25] !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/echostore: apply [20:39:30] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett) >>! In T323943#8479804, @Marostegui wrote: > @KHurd-WMF your private key was disclosed. Please make sure to gener... [20:40:27] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Marostegui) Excellent! Thank you for clarifying it! [20:41:32] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) Yes, sorry for my noobness. That is a new key. [20:41:46] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) Everything should be completed on my end, at this point. [20:42:06] !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/echostore: apply [20:42:29] !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [21:23:48] jouncebot: now [21:23:48] For the next 10 hour(s) and 36 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221219T0800) [21:23:57] Going to do a scap release [21:25:08] !log jhuneidi@deploy1002 Installing scap version "4.30.3" for 563 hosts [21:25:36] !log jhuneidi@deploy1002 Installation of scap version "4.30.3" completed for 563 hosts [21:34:11] PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:42:03] (03PS1) 10JHathaway: Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) [21:43:13] (03CR) 10JHathaway: "kindly review!" [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [21:43:26] 10SRE, 10DiscussionTools, 10MW-1.40-notes (1.40.0-wmf.17; 2023-01-02), 10Patch-For-Review, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10matmarex) 05Open→03Resolved I guess this is resolved. Thank you all for the fixe... [21:45:24] (03PS2) 10JHathaway: Add vendored module bodgit/puppet-postfix [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) [21:47:05] (03CR) 10JHathaway: Add vendored module bodgit/puppet-postfix (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway) [21:55:01] (03PS2) 10JHathaway: Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) [21:55:22] (03CR) 10CI reject: [V: 04-1] Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [21:58:19] !log eevans@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase-dev[1004-1006].eqiad.wmnet [22:06:37] !log eevans@cumin1001 START - Cookbook sre.dns.netbox [22:09:00] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev[1004-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [22:10:21] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev[1004-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [22:10:21] !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:10:22] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase-dev[1004-1006].eqiad.wmnet [22:19:17] (03PS1) 10Andrew Bogott: eqiad1 cumin master: include observer project in config [puppet] - 10https://gerrit.wikimedia.org/r/869319 [22:21:20] (03PS1) 10Eevans: Decommissioning restbase-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/869320 (https://phabricator.wikimedia.org/T325387) [22:23:31] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [22:24:12] (03PS3) 10JHathaway: Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) [22:24:31] (03CR) 10jenkins-bot: Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [22:24:40] !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [22:25:02] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [22:25:10] (03CR) 10Bking: [C: 03+2] wdqs: auto-extract kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [22:25:17] (03PS17) 10Bking: wdqs: auto-extract kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) [22:25:30] (03CR) 10Bking: [V: 03+2] wdqs: auto-extract kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [22:25:32] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: auto-extract kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [22:25:56] (03PS4) 10JHathaway: Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) [22:27:09] (03Merged) 10jenkins-bot: wdqs: auto-extract kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking) [22:34:32] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [22:37:21] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2001.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2001.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:38:06] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [22:38:08] ^^ we're doing maintenance and this alert should have been silenced! will ack [22:38:11] PROBLEM - Host wdqs2001 is DOWN: PING CRITICAL - Packet loss = 100% [22:38:21] RECOVERY - Host wdqs2001 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [22:43:09] !log [WDQS] Pooled `wdqs2007` (was depooled, we may have forgotten to re-pool it in the last week or so) [22:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:21] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:50:23] (03PS2) 10Cwhite: site: assign role logging::opensearch::data to logstash203[67] [puppet] - 10https://gerrit.wikimedia.org/r/867631 (https://phabricator.wikimedia.org/T321335) [22:51:28] (03CR) 10Cwhite: [C: 03+2] site: assign role logging::opensearch::data to logstash203[67] [puppet] - 10https://gerrit.wikimedia.org/r/867631 (https://phabricator.wikimedia.org/T321335) (owner: 10Cwhite) [22:51:51] RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:15] !log [WDQS] Pooled `wdqs2005` [22:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:34] (03PS1) 10Cwhite: logstash: add logstash 203[67] to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/869250 (https://phabricator.wikimedia.org/T321335) [22:56:52] (03CR) 10CI reject: [V: 04-1] logstash: add logstash 203[67] to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/869250 (https://phabricator.wikimedia.org/T321335) (owner: 10Cwhite) [22:57:20] 10SRE, 10DiscussionTools, 10MW-1.40-notes (1.40.0-wmf.17; 2023-01-02), 10Patch-For-Review, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10jcrespo) I also need to write an incident report- I will probably require some of yo... [22:57:58] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [22:58:04] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [22:58:10] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [22:58:10] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [22:58:16] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [22:58:16] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [22:58:21] (03PS2) 10Cwhite: logstash: add logstash 203[67] to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/869250 (https://phabricator.wikimedia.org/T321335) [22:58:28] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [22:58:33] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [22:59:18] !log [WDQS] Continuing with reboot of WDQS hosts. Doing 1 host each of `[eqiad, codfw]` X `[internal, public]`, so 4 total hosts at once [22:59:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:51] (03CR) 10Cwhite: [C: 03+2] logstash: add logstash 203[67] to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/869250 (https://phabricator.wikimedia.org/T321335) (owner: 10Cwhite) [23:00:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:01:31] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [23:01:33] PROBLEM - Host wdqs2002 is DOWN: PING CRITICAL - Packet loss = 100% [23:01:35] RECOVERY - Host wdqs2002 is UP: PING OK - Packet loss = 0%, RTA = 33.27 ms [23:01:42] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [23:01:59] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:02:02] I guess the silences aren't working for that reboot cookbook [23:02:43] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:03:39] PROBLEM - Query Service HTTP Port on wdqs1004 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service [23:05:15] RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [23:05:32] !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2003.codfw.wmnet with OS bullseye [23:05:35] (03CR) 10Dzahn: "Because of the way that decom'ing works nowadays, I would normally do all the things _except_ the removal from site.pp (either keep the pr" [puppet] - 10https://gerrit.wikimedia.org/r/869320 (https://phabricator.wikimedia.org/T325387) (owner: 10Eevans) [23:06:51] (03CR) 10Dzahn: [C: 03+1] vrts: Enable vrts profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/868663 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [23:07:50] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [23:07:51] PROBLEM - Host wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100% [23:07:53] RECOVERY - Host wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [23:08:25] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:09:07] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:14:41] 10SRE, 10serviceops: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10Dzahn) @aaron CCing @akosiaris Depending how you want to look at it this is either a subtask and unblocks or a duplicate of T316223. Also see T316223#8383941, T316223#8185277. Cheers [23:17:58] (03PS1) 10Andrew Bogott: Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 [23:23:53] (03PS2) 10Andrew Bogott: Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 [23:30:46] !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [23:30:48] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2012.* [23:31:08] (03CR) 10CI reject: [V: 04-1] Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (owner: 10Andrew Bogott) [23:31:34] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:50] PROBLEM - WDQS SPARQL on wdqs2005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 244 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:32:00] PROBLEM - Query Service HTTP Port on wdqs2005 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service [23:32:16] !log [WDQS] Temporarily removing wdqs20[09-12] from pybal; these are new hosts that aren't ready for service until data reload has completed (long-running process). In meantime, remove these so they don't factor into pybal's depool threshold [23:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:27] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2011.* [23:32:32] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2010.* [23:32:39] !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2009.* [23:32:50] RECOVERY - WDQS SPARQL on wdqs2005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.217 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [23:33:02] RECOVERY - Query Service HTTP Port on wdqs2005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [23:33:36] Sorry for the WDQS noise; doing my best to fiddle with stuff to make the reboots less noisy. Looks like our reboot cookbook removes the silences before doing a follow-up puppet run to ensure the service is in a good state, so I'm working on patching that [23:33:38] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:38:27] (03PS1) 10Cwhite: logstash: clean up curator actions todo items [puppet] - 10https://gerrit.wikimedia.org/r/869251 (https://phabricator.wikimedia.org/T301760) [23:45:02] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [23:45:12] (03PS1) 10Cwhite: logstash: change ecs-default clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869252 [23:45:14] (03PS1) 10Cwhite: logstash: change ecs-test clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869253 [23:45:16] (03PS1) 10Cwhite: logstash: change w3creportingapi clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869254 [23:45:33] (03PS1) 10Ahmon Dancy: train-presync: Pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/869333 (https://phabricator.wikimedia.org/T325576) [23:45:38] (03CR) 10CI reject: [V: 04-1] logstash: change w3creportingapi clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869254 (owner: 10Cwhite) [23:47:23] (03CR) 10CI reject: [V: 04-1] train-presync: Pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/869333 (https://phabricator.wikimedia.org/T325576) (owner: 10Ahmon Dancy) [23:48:42] (03PS1) 10Ryan Kemper: wdqs: improve reliability of reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/869334 [23:49:40] (03PS2) 10Ahmon Dancy: train-presync: Pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/869333 (https://phabricator.wikimedia.org/T325576) [23:50:55] !log bking@cumin1001 START - Cookbook sre.wdqs.reboot [23:51:57] (03CR) 10RLazarus: [C: 03+2] "Thanks!" [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 (owner: 10RLazarus) [23:52:58] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10Eileenmcnaughton) [23:53:17] 10SRE, 10All-and-every-Wikisource, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Albertoleoncio) Including SRE as it involves Google Search Console [23:53:37] (03Merged) 10jenkins-bot: Refactor: Migrate from attrs to dataclasses [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 (owner: 10RLazarus) [23:54:00] (03CR) 10RLazarus: [C: 03+2] Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [23:55:26] (03CR) 10Jeena Huneidi: [C: 03+1] train-presync: Pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/869333 (https://phabricator.wikimedia.org/T325576) (owner: 10Ahmon Dancy) [23:55:31] (03Merged) 10jenkins-bot: Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [23:59:49] (03CR) 10RLazarus: Add an option, off by default, to retry once when a request times out. (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)