[00:05:31] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:05:55] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:07:25] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49123 bytes in 4.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:08:35] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.242 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:09:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:14:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:19:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:23:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:28:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:31:17] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:37:41] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:47] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:45:19] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:46:55] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[00:51:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:56:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:59:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:04:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:25:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:26:42] <wikibugs>	 10SRE, 10DiscussionTools, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10tstarling) Since 2022-12-08 22:20 we have had high traffic to action=discussiontoolspageinfo , with a daily peak of around 2k req/s. So that is {T321961}.   The outage was more a latency...
[01:28:00] <wikibugs>	 (03PS1) 10Tim Starling: Revert "Start mobile DiscussionTools A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961)
[01:30:18] <wikibugs>	 (03PS2) 10Tim Starling: Revert "Start mobile DiscussionTools A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961)
[01:31:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:36:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:38:45] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:43:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:45:47] <icinga-wm>	 PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 5.427e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/d/000000305/maps-performances?orgId=1&viewPanel=11
[01:46:14] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961) (owner: 10Tim Starling)
[01:47:16] <wikibugs>	 (03CR) 10Platonides: [C: 03+1] Revert "Start mobile DiscussionTools A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961) (owner: 10Tim Starling)
[01:48:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:53:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:58:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:03:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:04:27] <icinga-wm>	 PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-namenode-backup-hdfs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:05:14] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "I'm around if you want to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961) (owner: 10Tim Starling)
[02:08:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:18:45] <jinxer-wm>	 (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[02:24:15] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:29:15] <jinxer-wm>	 (JobUnavailable) resolved: (6) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:29:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (UPDATE certificates) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:34:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (UPDATE certificates) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:39:10] <wikibugs>	 10SRE, 10DiscussionTools, 10Patch-For-Review, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10tstarling) discussiontoolspageinfo request rate  {F35874913}  A bit noisy starting around 14:50, but other days show that kind of pattern near the daily peak. It get...
[02:40:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:45:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:51:02] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Revert "Start mobile DiscussionTools A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961) (owner: 10Tim Starling)
[02:51:43] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Start mobile DiscussionTools A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868544 (https://phabricator.wikimedia.org/T321961) (owner: 10Tim Starling)
[03:02:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:07:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:08:51] <logmsgbot>	 !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: disable wgDiscussionToolsABTest T325477 T321961 (duration: 15m 23s)
[03:08:57] <stashbot>	 T325477: large number of 503 errors - https://phabricator.wikimedia.org/T325477
[03:08:57] <stashbot>	 T321961: [Config Change] Start mobile DiscussionTools A/B test - https://phabricator.wikimedia.org/T321961
[03:10:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:15:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:19:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:24:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:27:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:28:33] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 162 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:29:41] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[03:30:07] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:31:17] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[03:37:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:43:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:58:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:59:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:00:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:01:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:13:19] <wikibugs>	 10SRE, 10DiscussionTools, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10DLynch) @tstarling can those charts get more granular? I'd be very interested to know whether it was the `transcludedfrom` or `threaditemshtml` prop being requested from `discussiontoolsp...
[04:14:16] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:16:07] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[04:16:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:17:41] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[04:22:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:31:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:36:02] <wikibugs>	 10SRE, 10DiscussionTools, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10tstarling) Since the configuration variable is saved into the Varnish/ATS cache, you can still see it on some pages. For example viewing https://es.m.wikipedia.org/wiki/Sede_de_la_Organiz...
[04:41:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:42:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:43:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:44:28] <wikibugs>	 10SRE, 10DiscussionTools, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10tstarling) The stack trace for the network request shows controller.js init() calling getPageData():  `lang=js  // TODO: Isn't this too early to load it? We will only need it if the user...
[04:49:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:54:02] <wikibugs>	 10SRE, 10DiscussionTools, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10DLynch) Yeah, the issue here is (mostly) us including the general DiscussionTools JS for some test-related effects, and it having those early-loading side-effects. We should either pull o...
[04:58:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:59:18] <wikibugs>	 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10Ladsgroup) Doodling some ideas: {F35875012}  {F35875011}  {F35875010}  {F35875009}  {F35875008}  {F35875007}
[05:00:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[05:04:41] <wikibugs>	 10SRE, 10DiscussionTools, 10Wikimedia-Incident: large number of 503 errors - https://phabricator.wikimedia.org/T325477 (10tstarling) Does the response have any private data in it? I think if ApiDiscussionToolsPageInfo::execute() called $this->getMain()->setCacheMode( 'public' ) and you set the query string p...
[05:09:01] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] "If we are sure all contributors have agreed. I think there is one that's banned?" [puppet] - 10https://gerrit.wikimedia.org/r/868707 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[05:09:18] <wikibugs>	 10SRE, 10DiscussionTools, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10tstarling)
[05:09:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:10:34] <wikibugs>	 10SRE, 10DiscussionTools, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10DLynch) It shouldn't. That particular call is just asking whether anything on the page is inside a transclusion, to work out whether it can actually be u...
[05:22:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:40:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[05:40:47] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:45:04] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:46:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:47:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[05:51:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:56:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:01:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:12:47] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:16:47] <icinga-wm>	 PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[06:21:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:25:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Refactor: Migrate from attrs to dataclasses [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 (owner: 10RLazarus)
[06:26:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:31:05] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:31:49] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:38:41] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:39:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:44:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:45:31] <wikibugs>	 10SRE, 10DiscussionTools, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10Ladsgroup) I might be missing something obvious but https://es.m.wikipedia.org/wiki/Sede_de_la_Organizaci%C3%B3n_de_las_Naciones_Unidas is an article. Wh...
[06:46:27] <wikibugs>	 10SRE, 10DiscussionTools, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10DLynch) See: > the issue here is (mostly) us including the general DiscussionTools JS for some test-related effects, and it having those early-loading si...
[06:48:01] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 11686
[06:48:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:51:29] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:53:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:55:05] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:55:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:55:49] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:58:11] <wikibugs>	 (03PS1) 10Marostegui: db1207-db1229: Set up new hosts [puppet] - 10https://gerrit.wikimedia.org/r/869049 (https://phabricator.wikimedia.org/T325209)
[06:59:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] db1207-db1229: Set up new hosts [puppet] - 10https://gerrit.wikimedia.org/r/869049 (https://phabricator.wikimedia.org/T325209) (owner: 10Marostegui)
[07:00:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:02:05] <wikibugs>	 (03PS2) 10Marostegui: db1207-db1229: Set up new hosts [puppet] - 10https://gerrit.wikimedia.org/r/869049 (https://phabricator.wikimedia.org/T325209)
[07:02:41] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1207-db1229: Set up new hosts [puppet] - 10https://gerrit.wikimedia.org/r/869049 (https://phabricator.wikimedia.org/T325209) (owner: 10Marostegui)
[07:03:55] <wikibugs>	 (03PS1) 10Marostegui: db2185-db2187: Add header [puppet] - 10https://gerrit.wikimedia.org/r/869050
[07:04:01] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 11686
[07:04:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2185-db2187: Add header [puppet] - 10https://gerrit.wikimedia.org/r/869050 (owner: 10Marostegui)
[07:04:41] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 136 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:05:15] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 28398
[07:05:29] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'configure' for AS: 28398
[07:06:17] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:07:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:12:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:15:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:18:46] <wikibugs>	 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10ayounsi)
[07:22:33] <logmsgbot>	 !log phedenskog@deploy1002 Started deploy [performance/navtiming@5770d46]: (no justification provided)
[07:22:42] <logmsgbot>	 !log phedenskog@deploy1002 Finished deploy [performance/navtiming@5770d46]: (no justification provided) (duration: 00m 08s)
[07:25:39] <wikibugs>	 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10ayounsi) @BCornwall I took care of both of them :)  185.15.56.1 is the generic NAT IP for the WMCS realm.  @aborrero, for context, 208.80.153.254 is our old recursive DNS IP and it'...
[07:25:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:53:15] <wikibugs>	 (03CR) 10Muehlenhoff: lists: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868707 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[07:54:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] orchestrator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868708 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[07:54:14] <wikibugs>	 (03PS2) 10Muehlenhoff: orchestrator: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868708 (https://phabricator.wikimedia.org/T308013)
[07:54:37] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] lists: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868707 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221219T0800)
[08:00:21] <wikibugs>	 (03CR) 10Muehlenhoff: lists: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868707 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[08:02:09] <moritzm>	 !log installing openexr security updates
[08:02:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:40] <wikibugs>	 (03PS1) 10Ladsgroup: Emergency: discussiontoolspageinfo return empty response in non-talk ns [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477)
[08:08:06] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Emergency: discussiontoolspageinfo return empty response in non-talk ns [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) (owner: 10Ladsgroup)
[08:11:26] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove db1206 testing [puppet] - 10https://gerrit.wikimedia.org/r/869164
[08:11:39] <wikibugs>	 (03CR) 10Ladsgroup: Emergency: discussiontoolspageinfo return empty response in non-talk ns [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) (owner: 10Ladsgroup)
[08:11:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db1206 testing [puppet] - 10https://gerrit.wikimedia.org/r/869164 (owner: 10Marostegui)
[08:19:55] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:24:23] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:25:59] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:26:13] <wikibugs>	 (03CR) 10Volans: ensure_canary: 0-pad the instance counter (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott)
[08:27:29] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 (owner: 10RLazarus)
[08:29:07] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[08:29:43] <wikibugs>	 (03CR) 10David Caro: "This is really unfortunate, we have a bunch of VMs whose numbering is not padded (and aim to not be so, as we might pass the 99 barrier)" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott)
[08:32:51] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 29535
[08:33:03] <wikibugs>	 (03CR) 10David Caro: ensure_canary: 0-pad the instance counter (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott)
[08:33:20] <wikibugs>	 (03CR) 10Volans: "question inline" [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[08:33:22] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 29535
[08:33:56] <wikibugs>	 (03PS1) 10Aqu: Test to debug  missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850)
[08:34:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[08:35:59] <wikibugs>	 (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38860/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[08:37:58] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[08:39:35] <wikibugs>	 (03CR) 10Ayounsi: First stab at possible ferm::qos resource for DSCP marking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868156 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[08:40:59] <wikibugs>	 (03PS2) 10Aqu: Test to debug  missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850)
[08:40:59] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:41:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[08:42:25] <wikibugs>	 (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38861/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[08:45:41] <wikibugs>	 (03PS3) 10Aqu: Test to debug  missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850)
[08:45:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[08:46:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Test to debug  missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[08:46:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[08:46:58] <wikibugs>	 (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38862/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[08:50:49] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Emergency: discussiontoolspageinfo return empty response in non-talk ns [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) (owner: 10Ladsgroup)
[08:53:21] <wikibugs>	 (03PS1) 10David Caro: Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167
[08:53:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (owner: 10David Caro)
[08:54:12] <wikibugs>	 (03PS4) 10Aqu: Test to debug  missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850)
[08:54:14] <wikibugs>	 (03CR) 10David Caro: [C: 04-1] ensure_canary: use the smaller cloudvirt-canary-ceph flavor (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868799 (owner: 10Andrew Bogott)
[08:55:22] <wikibugs>	 (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38863/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[08:56:01] <wikibugs>	 (03PS2) 10David Caro: Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167
[08:56:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add moved to wmcs-cookbooks message. [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (owner: 10David Caro)
[08:56:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[08:57:08] <wikibugs>	 (03Merged) 10jenkins-bot: Emergency: discussiontoolspageinfo return empty response in non-talk ns [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) (owner: 10Ladsgroup)
[08:58:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[08:59:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[09:01:27] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:02:54] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[09:02:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:03:45] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[09:04:06] <dcausse>	 !log restarting blazegraph on wdqs1015 (BlazegraphFreeAllocatorsDecreasingRapidly)
[09:04:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:05:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Add end date and contact for aitolkyn's access [puppet] - 10https://gerrit.wikimedia.org/r/869170
[09:05:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) (owner: 10Ladsgroup)
[09:05:47] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:868867|Emergency: discussiontoolspageinfo return empty response in non-talk ns (T325477)]]
[09:05:51] <stashbot>	 T325477: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477
[09:06:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[09:06:29] <wikibugs>	 (03CR) 10David Caro: "CI does not like that tox does not generate any logs :/, but I think we can merge this as there's no more code to build on it expected any" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/869167 (owner: 10David Caro)
[09:07:35] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:868867|Emergency: discussiontoolspageinfo return empty response in non-talk ns (T325477)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[09:07:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add end date and contact for aitolkyn's access [puppet] - 10https://gerrit.wikimedia.org/r/869170 (owner: 10Muehlenhoff)
[09:07:58] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:08:11] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38864/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[09:11:08] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:12:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[09:12:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:14:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[09:15:12] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:868867|Emergency: discussiontoolspageinfo return empty response in non-talk ns (T325477)]] (duration: 09m 24s)
[09:15:15] <stashbot>	 T325477: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477
[09:16:14] <wikibugs>	 (03PS2) 10Muehlenhoff: sre.misc-clusters.roll-restart-reboot-eventschemas: Also restart envoyproxy [cookbooks] - 10https://gerrit.wikimedia.org/r/860556
[09:17:42] <aqu>	 !log About to deploy analytics/refinery (bug fix in HDFS usage pipeline)
[09:17:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:14] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:18:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:18:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove d-i-test from special handling [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/841919 (owner: 10Muehlenhoff)
[09:19:28] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:19:36] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@2d53aff] (hadoop-test): Fix bug fix in HDFS usage pipeline TEST [analytics/refinery@2d53aff]
[09:20:51] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@2d53aff] (hadoop-test): Fix bug fix in HDFS usage pipeline TEST [analytics/refinery@2d53aff] (duration: 01m 14s)
[09:21:24] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@2d53aff]: Fix bug fix in HDFS usage pipeline [analytics/refinery@2d53aff]
[09:24:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10jnuche) > I think I'd vote contint-root, but a question I have is: is there a way to add the contint-roo...
[09:24:46] <wikibugs>	 10SRE, 10Gerrit, 10serviceops-collab: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631 (10ayounsi) @Dzahn, why is this not relevant anymore?
[09:27:28] <wikibugs>	 (03PS6) 10Slyngshede: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021
[09:28:36] <wikibugs>	 (03CR) 10Ayounsi: "Why not removing "NO_PUPPETDB_VMS" as well? It would help keep the code lean." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/841919 (owner: 10Muehlenhoff)
[09:29:27] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@2d53aff]: Fix bug fix in HDFS usage pipeline [analytics/refinery@2d53aff] (duration: 08m 02s)
[09:29:53] <logmsgbot>	 !log aqu@deploy1002 Started deploy [analytics/refinery@2d53aff] (thin): Fix bug fix in HDFS usage pipeline THIN [analytics/refinery@2d53aff]
[09:30:01] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [analytics/refinery@2d53aff] (thin): Fix bug fix in HDFS usage pipeline THIN [analytics/refinery@2d53aff] (duration: 00m 08s)
[09:33:52] <wikibugs>	 (03PS1) 10Volans: cumin::cloud_master: add openstack dependencies [puppet] - 10https://gerrit.wikimedia.org/r/869173 (https://phabricator.wikimedia.org/T319401)
[09:39:53] <wikibugs>	 (03PS1) 10Ayounsi: Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175
[09:40:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi)
[09:41:11] <wikibugs>	 (03PS2) 10Ayounsi: Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175
[09:41:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for three researchers [puppet] - 10https://gerrit.wikimedia.org/r/869176
[09:41:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove access for three researchers [puppet] - 10https://gerrit.wikimedia.org/r/869176 (owner: 10Muehlenhoff)
[09:42:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi)
[09:43:13] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove access for three researchers [puppet] - 10https://gerrit.wikimedia.org/r/869176
[09:43:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove access for three researchers [puppet] - 10https://gerrit.wikimedia.org/r/869176 (owner: 10Muehlenhoff)
[09:43:53] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38865/console" [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[09:44:55] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812239 (owner: 10Muehlenhoff)
[09:44:59] <wikibugs>	 (03CR) 10Ayounsi: "The CI error doesn't seem to be related to this CR." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi)
[09:45:27] <wikibugs>	 (03CR) 10Majavah: "somehow the PCC link does not work :(" [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[09:45:52] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove access for three researchers [puppet] - 10https://gerrit.wikimedia.org/r/869176
[09:47:22] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@6ac3269]: Fix bug fix in HDFS usage pipeline TEST [airflow-dags@6ac3269]
[09:47:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: haproxy: support 3 firewalling options [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[09:47:34] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@6ac3269]: Fix bug fix in HDFS usage pipeline TEST [airflow-dags@6ac3269] (duration: 00m 11s)
[09:48:51] <logmsgbot>	 !log aqu@deploy1002 Started deploy [airflow-dags/analytics@6ac3269]: Fix bug fix in HDFS usage pipeline [airflow-dags@6ac3269]
[09:49:04] <logmsgbot>	 !log aqu@deploy1002 Finished deploy [airflow-dags/analytics@6ac3269]: Fix bug fix in HDFS usage pipeline [airflow-dags@6ac3269] (duration: 00m 13s)
[09:49:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for three researchers [puppet] - 10https://gerrit.wikimedia.org/r/869176 (owner: 10Muehlenhoff)
[09:52:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[09:54:29] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudlb: haproxy: support 3 firewalling options [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992)
[09:55:32] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: cloudlb: haproxy: support 3 firewalling options [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992)
[09:55:53] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: cloudlb: haproxy: support 3 firewalling options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[09:56:38] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: cloudlb: haproxy: support 3 firewalling options [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992)
[09:59:06] <moritzm>	 !log update bullseye netboot image for Bullseye 11.6 point release T325186
[09:59:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:10] <stashbot>	 T325186: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186
[10:00:03] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 04-1] "I think we want to avoid granting full privileges on the Jenkins target hosts to Jenkins deployers. Also the key applies to non-contint se" [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn)
[10:01:23] <wikibugs>	 (03CR) 10Awight: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869178 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight)
[10:02:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10jnuche) @Dzahn sorry, I just saw your patch at https://gerrit.wikimedia.org/r/c/operations/puppet/+/8687...
[10:06:42] <wikibugs>	 (03CR) 10Awight: [C: 03+2] "Deploying to the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869178 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight)
[10:07:31] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Expand mapframe ExternalData [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869178 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight)
[10:12:36] <wikibugs>	 (03PS1) 10Marostegui: wikitech.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/869179 (https://phabricator.wikimedia.org/T325154)
[10:12:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wikitech.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/869179 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui)
[10:13:48] <wikibugs>	 (03PS1) 10Majavah: Only preload getPageData if there's thread data for the page [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868869 (https://phabricator.wikimedia.org/T325477)
[10:14:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868869 (https://phabricator.wikimedia.org/T325477) (owner: 10Majavah)
[10:15:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[10:16:25] <wikibugs>	 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phase out "redis_sessions" cluster and away from memcached cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) 05Open→03Resolved
[10:19:53] <wikibugs>	 (03Merged) 10jenkins-bot: Only preload getPageData if there's thread data for the page [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868869 (https://phabricator.wikimedia.org/T325477) (owner: 10Majavah)
[10:20:09] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:868869|Only preload getPageData if there's thread data for the page (T325477)]]
[10:20:14] <stashbot>	 T325477: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477
[10:20:30] <wikibugs>	 (03PS1) 10Volans: dns: update type hints comments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869180
[10:21:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi)
[10:21:50] <logmsgbot>	 !log taavi@deploy1002 taavi and taavi: Backport for [[gerrit:868869|Only preload getPageData if there's thread data for the page (T325477)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet
[10:23:43] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.k8s.pool-depool-cluster
[10:23:43] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.discovery.service-route
[10:23:45] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
[10:23:45] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0)
[10:24:18] <elukey>	 I ran "check" for ml-serve-codfw --^
[10:26:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] dns: update type hints comments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869180 (owner: 10Volans)
[10:26:53] <wikibugs>	 (03CR) 10Volans: [C: 03+2] dns: update type hints comments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869180 (owner: 10Volans)
[10:27:42] <wikibugs>	 (03Merged) 10jenkins-bot: dns: update type hints comments [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869180 (owner: 10Volans)
[10:27:54] <wikibugs>	 (03PS1) 10Elukey: sre.k8s.maintenance: add missing admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677)
[10:28:00] <wikibugs>	 (03PS3) 10Volans: Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi)
[10:28:07] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:868869|Only preload getPageData if there's thread data for the page (T325477)]] (duration: 07m 58s)
[10:28:11] <stashbot>	 T325477: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477
[10:28:28] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[10:28:49] <wikibugs>	 (03PS2) 10Elukey: sre.k8s.maintenance: add missing admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677)
[10:29:09] <elukey>	 volans: o/ thanksss I tried to compress the msg, lemme know if the old was better
[10:29:38] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi)
[10:30:01] <volans>	 elukey: ack, I'll comment on the CR
[10:30:27] <wikibugs>	 (03Merged) 10jenkins-bot: Remove code specific to d-i-test [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/869175 (owner: 10Ayounsi)
[10:31:11] <wikibugs>	 (03CR) 10FNegri: cumin::cloud_master: add openstack dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869173 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:31:37] <wikibugs>	 (03PS5) 10Aqu: Test to debug  missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850)
[10:31:44] <wikibugs>	 (03PS1) 10Effie Mouzeli: Puppet: Remove nutcracker and multi-dc redis [puppet] - 10https://gerrit.wikimedia.org/r/869183
[10:33:13] <wikibugs>	 (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38866/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[10:33:43] <logmsgbot>	 !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@b4d31fb]: incoming_link: relax sensor timeout to default 7d
[10:34:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): ceph: decide and/or test 1 network interface setup performance - https://phabricator.wikimedia.org/T325531 (10aborrero)
[10:34:36] <wikibugs>	 (03CR) 10Volans: "questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[10:35:07] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:35:19] <wikibugs>	 (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/869173 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[10:35:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): ceph: decide and/or test 1 network interface setup performance - https://phabricator.wikimedia.org/T325531 (10aborrero) p:05Triage→03High
[10:36:11] <logmsgbot>	 !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@b4d31fb]: incoming_link: relax sensor timeout to default 7d (duration: 02m 28s)
[10:39:51] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:43:32] <wikibugs>	 (03PS6) 10Aqu: Test to debug  missing scripts on standby namenode [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850)
[10:44:30] <wikibugs>	 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10aborrero) In a quick search using cumin I didn't find anything relevant:  `lang=shell-session aborrero@cloud-cumin-03:~$ sudo cumin --force -x '*' "grep 208.80.154.254 /etc/resolv.c...
[10:44:39] <wikibugs>	 (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38867/console" [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[10:45:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): ceph: decide and/or test 1 network interface setup performance - https://phabricator.wikimedia.org/T325531 (10dcaro) I thought that we had decided already to test, and depending on that then decided if go/nogo for the implementa...
[10:47:09] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki-common: Replace redis_session servers with rdb* [deployment-charts] - 10https://gerrit.wikimedia.org/r/867707 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[10:47:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): ceph: decide and/or test 1 network interface setup performance - https://phabricator.wikimedia.org/T325531 (10aborrero) >>! In T325531#8477745, @dcaro wrote: > I thought that we had decided already to test, and depending on that...
[10:48:30] <wikibugs>	 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10ayounsi) NAT logs or tcpdump on the device doing NAT should help pinpoint the host(s).
[10:51:40] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
[10:52:12] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki-common: Replace redis_session servers with rdb* [deployment-charts] - 10https://gerrit.wikimedia.org/r/867707 (https://phabricator.wikimedia.org/T267581) (owner: 10Effie Mouzeli)
[10:54:58] <wikibugs>	 (03PS7) 10Aqu: Fix missing script in HDFS usage dataset pipeline [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850)
[11:03:26] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Fix missing script in HDFS usage dataset pipeline [puppet] - 10https://gerrit.wikimedia.org/r/869166 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[11:09:14] <wikibugs>	 (03PS1) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187
[11:11:21] <wikibugs>	 (03Abandoned) 10Jbond: base::cloud::production: allow cloud prod to override ssh [puppet] - 10https://gerrit.wikimedia.org/r/868716 (owner: 10Jbond)
[11:13:17] <wikibugs>	 (03PS3) 10Elukey: sre.k8s.maintenance: fix missing admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677)
[11:13:52] <wikibugs>	 (03CR) 10Elukey: sre.k8s.maintenance: fix missing admin reason (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[11:13:53] <icinga-wm>	 RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:25:54] <wikibugs>	 (03PS4) 10Elukey: sre.k8s.maintenance: fix missing admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677)
[11:27:03] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[11:27:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.k8s.maintenance: fix missing admin reason [cookbooks] - 10https://gerrit.wikimedia.org/r/869182 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[11:27:49] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
[11:28:00] <wikibugs>	 (03CR) 10Esanders: "Thanks, this looks correct, although note that any namespace with signatures is considered a talk namespace by us, notably the main namesp" [extensions/DiscussionTools] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/868867 (https://phabricator.wikimedia.org/T325477) (owner: 10Ladsgroup)
[11:29:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.k8s.pool-depool-cluster check 1 in ml-serve-codfw: maintenance
[11:29:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.discovery.service-route
[11:29:18] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
[11:29:18] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.k8s.pool-depool-cluster (exit_code=0) check 1 in ml-serve-codfw: maintenance
[11:29:28] <elukey>	 ok better now
[11:37:59] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] php-multiversion-base: add sendmail [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/868428 (https://phabricator.wikimedia.org/T325131) (owner: 10Giuseppe Lavagetto)
[11:43:04] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on 10 hosts with reason: Reverting presto cluster size from 15 to 5 as a test
[11:43:23] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on 10 hosts with reason: Reverting presto cluster size from 15 to 5 as a test
[11:48:23] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[11:49:11] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:53:23] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[11:53:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:54:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "yay lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869183 (owner: 10Effie Mouzeli)
[11:59:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm minor comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro)
[12:00:21] <wikibugs>	 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10aborrero) It is `diffscan02.automation-framework.eqiad1.wikimedia.cloud`.  There are 1k connections like this:  ` tcp      6 59 SYN_SENT src=172.16.3.44 dst=208.80.154.254 sport=596...
[12:01:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (you can also remove role::ipsec and the strongswan classes, but can also be done in separate patch)." [puppet] - 10https://gerrit.wikimedia.org/r/869183 (owner: 10Effie Mouzeli)
[12:03:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:05:32] <wikibugs>	 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10ayounsi) a:03BCornwall Nice! that makes sens as it scans all our IPs.  @BCornwall I think everything is completed here!
[12:06:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] nginx: let puppet pick the correct provider [puppet] - 10https://gerrit.wikimedia.org/r/868431 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[12:09:56] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on an-presto[1001-1005].eqiad.wmnet with reason: Trying five of the new preto servers instead of the original five
[12:10:23] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on an-presto[1001-1005].eqiad.wmnet with reason: Trying five of the new preto servers instead of the original five
[12:13:49] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.remove-downtime for an-presto[1006-1010].eqiad.wnet
[12:13:50] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-presto[1006-1010].eqiad.wnet
[12:13:56] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] Use a single file for public key (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri)
[12:15:18] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] cumin::cloud_master: add openstack dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869173 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[12:15:30] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] cumin::cloud_master: add openstack dependencies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869173 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[12:16:46] <wikibugs>	 (03PS1) 10Muehlenhoff: httpd: Let Puppet pick the init provider [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783)
[12:17:32] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cumin::cloud_master: add openstack dependencies [puppet] - 10https://gerrit.wikimedia.org/r/869173 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[12:18:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] analytics::cluster: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868704 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:18:44] <wikibugs>	 (03PS2) 10Muehlenhoff: analytics::cluster: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868704 (https://phabricator.wikimedia.org/T308013)
[12:23:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:25:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] nutcracker: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/811227 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:32:22] <wikibugs>	 (03CR) 10Jbond: "I think both theses modules seem to be of a high quality and useful to our puppet code so no objection from me.  however i would like a se" [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway)
[12:33:56] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869199 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff)
[12:38:40] <wikibugs>	 (03PS2) 10Muehlenhoff: vrts / doc / etherpad / planet: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868711 (https://phabricator.wikimedia.org/T308013)
[12:39:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:42:59] <wikibugs>	 (03CR) 10Jaime Nuche: mwdebug_deploy: remove configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche)
[12:44:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] vrts / doc / etherpad / planet: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868711 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:44:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:45:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST nodes) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:46:10] <wikibugs>	 (03PS2) 10Muehlenhoff: acmechief/ncredir: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868710 (https://phabricator.wikimedia.org/T308013)
[12:48:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] acmechief/ncredir: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/868710 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff)
[12:50:01] <wikibugs>	 (03PS2) 10Muehlenhoff: vrts: Enable vrts profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/868663 (https://phabricator.wikimedia.org/T135991)
[12:50:14] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:56:27] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:57:41] <wikibugs>	 (03PS3) 10Jbond: wmflib: function to get the ips for all hosts in a specific resource [puppet] - 10https://gerrit.wikimedia.org/r/868653
[12:57:57] <wikibugs>	 (03PS3) 10Jbond: P:sretest: test the new ips functions [puppet] - 10https://gerrit.wikimedia.org/r/868654
[12:58:57] <icinga-wm>	 PROBLEM - Check systemd state on puppetdb1003 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:59:23] <icinga-wm>	 PROBLEM - puppet last run on puppetdb1003 is CRITICAL: CRITICAL: Puppet last ran 19 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[13:00:35] <wikibugs>	 (03PS1) 10Majavah: P:grafana: move some profile declarations to roles [puppet] - 10https://gerrit.wikimedia.org/r/869208 (https://phabricator.wikimedia.org/T307465)
[13:00:37] <wikibugs>	 (03PS1) 10Majavah: P:grafana: make the logo file customizable [puppet] - 10https://gerrit.wikimedia.org/r/869209 (https://phabricator.wikimedia.org/T307465)
[13:00:39] <wikibugs>	 (03PS1) 10Majavah: P:metricsinfra: add profile and role for a Grafana server [puppet] - 10https://gerrit.wikimedia.org/r/869210 (https://phabricator.wikimedia.org/T307465)
[13:00:41] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra: add haproxy config for grafana [puppet] - 10https://gerrit.wikimedia.org/r/869211 (https://phabricator.wikimedia.org/T307465)
[13:01:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:metricsinfra: add profile and role for a Grafana server [puppet] - 10https://gerrit.wikimedia.org/r/869210 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah)
[13:01:57] <icinga-wm>	 RECOVERY - Check systemd state on puppetdb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:02:08] <wikibugs>	 (03PS2) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187
[13:02:58] <wikibugs>	 (03PS2) 10Majavah: P:metricsinfra: add profile and role for a Grafana server [puppet] - 10https://gerrit.wikimedia.org/r/869210 (https://phabricator.wikimedia.org/T307465)
[13:03:00] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::metricsinfra: add haproxy config for grafana [puppet] - 10https://gerrit.wikimedia.org/r/869211 (https://phabricator.wikimedia.org/T307465)
[13:03:47] <wikibugs>	 (03PS4) 10Jbond: wmflib: function to get the ips for all hosts in a specific resource [puppet] - 10https://gerrit.wikimedia.org/r/868653
[13:03:54] <wikibugs>	 (03PS4) 10Jbond: P:sretest: test the new ips functions [puppet] - 10https://gerrit.wikimedia.org/r/868654
[13:05:57] <wikibugs>	 (03PS1) 10Volans: cumin::cloud_master: configure openstack backend [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401)
[13:07:27] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38873/console" [puppet] - 10https://gerrit.wikimedia.org/r/869209 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah)
[13:07:43] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:12:30] <wikibugs>	 (03PS2) 10Volans: cumin::cloud_master: configure openstack backend [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401)
[13:12:32] <wikibugs>	 (03PS5) 10Jbond: wmflib: function to get the ips for all hosts in a specific resource [puppet] - 10https://gerrit.wikimedia.org/r/868653
[13:12:38] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[13:12:48] <wikibugs>	 (03PS5) 10Jbond: P:sretest: test the new ips functions [puppet] - 10https://gerrit.wikimedia.org/r/868654
[13:13:37] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_search:platform.service,swift-account-stats_thanos:prod.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:16:37] <wikibugs>	 (03PS6) 10Jbond: P:sretest: test the new ips functions [puppet] - 10https://gerrit.wikimedia.org/r/868654
[13:17:37] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38875/console" [puppet] - 10https://gerrit.wikimedia.org/r/868654 (owner: 10Jbond)
[13:18:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for cgal [puppet] - 10https://gerrit.wikimedia.org/r/869213
[13:19:16] <logmsgbot>	 !log phedenskog@deploy1002 Started deploy [performance/navtiming@6aedc70]: (no justification provided)
[13:19:24] <logmsgbot>	 !log phedenskog@deploy1002 Finished deploy [performance/navtiming@6aedc70]: (no justification provided) (duration: 00m 08s)
[13:23:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:24:52] <wikibugs>	 (03PS1) 10Btullis: Add another 12 GB of RAM to the presto server JVM [puppet] - 10https://gerrit.wikimedia.org/r/869214 (https://phabricator.wikimedia.org/T325331)
[13:25:43] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for cgal [puppet] - 10https://gerrit.wikimedia.org/r/869213 (owner: 10Muehlenhoff)
[13:27:01] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38876/console" [puppet] - 10https://gerrit.wikimedia.org/r/869214 (https://phabricator.wikimedia.org/T325331) (owner: 10Btullis)
[13:27:11] <moritzm>	 !log installing PHP 7.3 security updates on buster
[13:27:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:32:30] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/869214 (https://phabricator.wikimedia.org/T325331) (owner: 10Btullis)
[13:33:12] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "Thanks for the quick turnaround" [puppet] - 10https://gerrit.wikimedia.org/r/869214 (https://phabricator.wikimedia.org/T325331) (owner: 10Btullis)
[13:33:29] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Add another 12 GB of RAM to the presto server JVM [puppet] - 10https://gerrit.wikimedia.org/r/869214 (https://phabricator.wikimedia.org/T325331) (owner: 10Btullis)
[13:35:40] <wikibugs>	 (03PS1) 10Jbond: differ: fix fulldiff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/869216
[13:38:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] differ: fix fulldiff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/869216 (owner: 10Jbond)
[13:40:34] <wikibugs>	 (03PS2) 10Jbond: differ: fix fulldiff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/869216
[13:42:20] <_joe_>	 !log purge old docker images from deploy1002 by hand
[13:42:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:51:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] differ: fix fulldiff report [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/869216 (owner: 10Jbond)
[13:53:22] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/869217
[13:53:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/869217 (owner: 10Jbond)
[13:56:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib: function to get the ips for all hosts in a specific resource [puppet] - 10https://gerrit.wikimedia.org/r/868653 (owner: 10Jbond)
[13:58:22] <moritzm>	 !log installing glibc security updates
[13:58:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti4007 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/869220 (https://phabricator.wikimedia.org/T317247)
[14:02:58] <wikibugs>	 (03PS3) 10Volans: cumin::cloud_master: configure openstack backend [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401)
[14:03:15] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[14:05:39] <wikibugs>	 (03PS3) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187
[14:06:50] <moritzm>	 !log installing giflib security updates
[14:06:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:25] <logmsgbot>	 !log oblivian@deploy1002 Synchronized README: Null sync to force a redeployment of the php-fpm base image (duration: 13m 04s)
[14:09:56] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add configuration for sendmail [deployment-charts] - 10https://gerrit.wikimedia.org/r/868432 (https://phabricator.wikimedia.org/T325131) (owner: 10Giuseppe Lavagetto)
[14:10:13] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:13:09] <wikibugs>	 (03PS4) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187
[14:14:15] <wikibugs>	 (03PS5) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187
[14:14:34] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: add configuration for sendmail [deployment-charts] - 10https://gerrit.wikimedia.org/r/868432 (https://phabricator.wikimedia.org/T325131) (owner: 10Giuseppe Lavagetto)
[14:14:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe1002.eqiad.wmnet
[14:14:59] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: set up msmtp for use by mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/869222
[14:15:03] <wikibugs>	 (03PS6) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187
[14:16:19] <wikibugs>	 (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (owner: 10David Caro)
[14:16:47] <wikibugs>	 (03PS7) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187
[14:16:56] <wikibugs>	 (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (owner: 10David Caro)
[14:18:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for giflib [puppet] - 10https://gerrit.wikimedia.org/r/869223
[14:20:23] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.remove-downtime for an-presto[1011-1015].eqiad.wnet
[14:20:25] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for an-presto[1011-1015].eqiad.wnet
[14:20:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe1002.eqiad.wmnet
[14:22:43] <wikibugs>	 (03PS4) 10Volans: cumin::cloud_master: configure openstack backend [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401)
[14:23:40] <wikibugs>	 (03PS1) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224
[14:25:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[14:26:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for giflib [puppet] - 10https://gerrit.wikimedia.org/r/869223 (owner: 10Muehlenhoff)
[14:26:25] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[14:26:34] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: set up msmtp for use by mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/869222 (owner: 10Giuseppe Lavagetto)
[14:26:59] <wikibugs>	 (03PS2) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224
[14:28:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10ayounsi) 05Resolved→03Open a:05Cmjohnson→03Papaul The host is alerting with `puppetdb1003 (WMF10625)  Primary IPv6 missing DNS name` in https://netbox.wikimedia.org/extras/reports/network.N...
[14:28:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:28:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[14:29:08] <wikibugs>	 (03PS3) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224
[14:30:52] <wikibugs>	 (03PS8) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187
[14:30:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[14:31:13] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: set up msmtp for use by mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/869222 (owner: 10Giuseppe Lavagetto)
[14:31:29] <wikibugs>	 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10ayounsi) p:05Triage→03High
[14:32:19] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:32:27] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:33:23] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:33:26] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:33:47] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[14:33:52] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[14:34:23] <icinga-wm>	 ACKNOWLEDGEMENT - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 698 probes of 698 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map ayounsi https://phabricator.wikimedia.org/T325549 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:34:23] <icinga-wm>	 ACKNOWLEDGEMENT - Host ripe-atlas-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T325549
[14:34:23] <icinga-wm>	 ACKNOWLEDGEMENT - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% ayounsi https://phabricator.wikimedia.org/T325549
[14:34:46] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[14:34:48] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[14:34:51] <icinga-wm>	 PROBLEM - Host durum1001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:34:55] <sukhe>	 eh?
[14:35:12] <wikibugs>	 (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (owner: 10David Caro)
[14:35:14] <_joe_>	 I swear that was not me :P
[14:35:20] <sukhe>	 haha na
[14:35:27] <sukhe>	 definitely not you :)
[14:35:33] <sukhe>	 I can SSH just fine so checking
[14:36:21] <icinga-wm>	 RECOVERY - Host durum1001 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms
[14:36:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: typo fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/869225
[14:36:57] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mediawiki: typo fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/869225 (owner: 10Giuseppe Lavagetto)
[14:37:24] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[14:37:26] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[14:37:27] <icinga-wm>	 PROBLEM - Check systemd state on durum1001 is CRITICAL: CRITICAL - degraded: The following units failed: networking.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:55] <icinga-wm>	 RECOVERY - Check systemd state on durum1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:57] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Specify Citoid RESTBase URL separately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425)
[14:39:08] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Specify Citoid RESTBase URL separately [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425)
[14:46:52] <wikibugs>	 (03PS4) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224
[14:47:06] <wikibugs>	 (03CR) 10Mvolz: Specify Citoid RESTBase URL separately (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) (owner: 10Bartosz Dziewoński)
[14:48:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[14:50:41] <wikibugs>	 (03PS9) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187
[14:50:54] <wikibugs>	 (03PS1) 10Muehlenhoff: IDP: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/869228 (https://phabricator.wikimedia.org/T135991)
[14:51:24] <wikibugs>	 (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (owner: 10David Caro)
[14:52:45] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Spicerack: Load cookbooks from multiple directories - https://phabricator.wikimedia.org/T325168 (10Volans) 05In progress→03Resolved This is done in SPicerack v6.0.0
[14:52:51] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10Volans)
[14:53:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Decrease db2129 main traffic weight', diff saved to https://phabricator.wikimedia.org/P42725 and previous config saved to /var/cache/conftool/dbconfig/20221219-145357-marostegui.json
[14:54:48] <wikibugs>	 (03PS5) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224
[14:55:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be1002.eqiad.wmnet
[14:56:11] <wikibugs>	 10SRE, 10ops-ulsfo: ripe-atlas-ulsfo down - https://phabricator.wikimedia.org/T325549 (10RobH) Please note this was power disconnected by accident during last Friday's MSW swap, and I thought it came back but I suppose not!  I'll be onsite tomorrow for the recycling pickup and will work on the atlas then.
[14:56:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[14:59:25] <wikibugs>	 (03CR) 10Bartosz Dziewoński: Specify Citoid RESTBase URL separately (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) (owner: 10Bartosz Dziewoński)
[15:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221219T0800)
[15:00:05] <jouncebot>	 thcipriani: My dear minions, it's time we take the moon! Just kidding. Time for Planned DiscussionTools release deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221219T1500).
[15:00:05] <jouncebot>	 MatmaRex: A patch you scheduled for Planned DiscussionTools release is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[15:00:07] <wikibugs>	 (03PS2) 10Thcipriani: Release new DiscussionTools reply button enhancement to Arabic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868441 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch)
[15:00:13] <thcipriani>	 o/
[15:00:16] <MatmaRex>	 hi
[15:00:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:01:05] <thcipriani>	 hiya MatmaRex , I'll get this change cooking
[15:01:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1002.eqiad.wmnet
[15:01:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868441 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch)
[15:02:36] <wikibugs>	 (03Merged) 10jenkins-bot: Release new DiscussionTools reply button enhancement to Arabic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868441 (https://phabricator.wikimedia.org/T323537) (owner: 10DLynch)
[15:02:50] <logmsgbot>	 !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:868441|Release new DiscussionTools reply button enhancement to Arabic (T323537)]]
[15:02:54] <stashbot>	 T323537: [Config Change] Add Clear Affordances (with arrow) to beta feature (desktop) - https://phabricator.wikimedia.org/T323537
[15:02:59] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/850629 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah)
[15:04:28] <logmsgbot>	 !log thcipriani@deploy1002 thcipriani and kemayo: Backport for [[gerrit:868441|Release new DiscussionTools reply button enhancement to Arabic (T323537)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet
[15:04:45] <thcipriani>	 ^ MatmaRex should be on mwdebug, check please
[15:05:27] <MatmaRex>	 thcipriani: yup, looks good
[15:05:54] <MatmaRex>	 (seeing the new buttons at https://ar.wikipedia.org/wiki/نقاش:الصفحة_الرئيسية)
[15:05:58] <thcipriani>	 thanks for checking, going live then
[15:07:11] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "👍" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869178 (https://phabricator.wikimedia.org/T323113) (owner: 10Awight)
[15:10:38] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] P:metricsinfra: add thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/850629 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah)
[15:12:22] <logmsgbot>	 !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:868441|Release new DiscussionTools reply button enhancement to Arabic (T323537)]] (duration: 09m 31s)
[15:12:26] <stashbot>	 T323537: [Config Change] Add Clear Affordances (with arrow) to beta feature (desktop) - https://phabricator.wikimedia.org/T323537
[15:12:32] <wikibugs>	 (03PS5) 10Volans: cumin::cloud_master: configure openstack backend [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401)
[15:12:33] <thcipriani>	 ^ MatmaRex that should do it, should be live everywhere
[15:12:51] <MatmaRex>	 thank you thcipriani
[15:12:51] <wikibugs>	 (03PS10) 10David Caro: metricsinfra: don't use /cloud path prefix for prometheus [puppet] - 10https://gerrit.wikimedia.org/r/869187
[15:13:22] <thcipriani>	 MatmaRex: sure thing, thanks for giving me an extra hour :)
[15:13:46] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[15:14:41] <wikibugs>	 (03CR) 10David Caro: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (owner: 10David Caro)
[15:21:09] <wikibugs>	 (03PS35) 10Jelto: sre.gitlab.upgrade: add cookbook to upgrade GitLab version [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569)
[15:23:29] <wikibugs>	 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10Vgutierrez)
[15:23:48] <wikibugs>	 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10Vgutierrez) p:05Triage→03Medium
[15:27:07] <wikibugs>	 (03CR) 10Jelto: sre.gitlab.upgrade: add cookbook to upgrade GitLab version (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[15:30:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) Gotcha! yea, the -1 is accurate.  I will upload another patch for a new group.
[15:31:32] <wikibugs>	 (03CR) 10Andrew Bogott: ensure_canary: 0-pad the instance counter (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott)
[15:33:41] <wikibugs>	 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10Vgutierrez) * Monitoring issue: CPU seconds for haproxy, varnish and ATS is reported as 0 on bullseye hosts: https://grafana.wikimedia.org/goto/eCGKNUc4k?orgId=1, impacted metric name: `container_cpu_syst...
[15:34:34] <wikibugs>	 (03CR) 10David Caro: ensure_canary: 0-pad the instance counter (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868803 (owner: 10Andrew Bogott)
[15:36:26] <wikibugs>	 (03CR) 10David Caro: "PCC looks good now :)" [puppet] - 10https://gerrit.wikimedia.org/r/869187 (owner: 10David Caro)
[15:36:40] <wikibugs>	 (03Abandoned) 10Andrew Bogott: ensure_canary: use the smaller cloudvirt-canary-ceph flavor [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/868799 (owner: 10Andrew Bogott)
[15:42:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): ceph: test and decide 1 network interface setup - https://phabricator.wikimedia.org/T325531 (10dcaro)
[15:46:56] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: allow sending mail to the mailservers [deployment-charts] - 10https://gerrit.wikimedia.org/r/869234
[15:48:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2001.codfw.wmnet
[15:50:33] <wikibugs>	 (03PS1) 10David Caro: cloudweb: fix typo for labtesttoolsadmin [puppet] - 10https://gerrit.wikimedia.org/r/869235
[15:52:07] <wikibugs>	 (03Abandoned) 10David Caro: [WIP] webperf: Scrape coal exporter [puppet] - 10https://gerrit.wikimedia.org/r/608434 (https://phabricator.wikimedia.org/T225740) (owner: 10Dave Pifke)
[15:52:43] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "thx!" [puppet] - 10https://gerrit.wikimedia.org/r/869235 (owner: 10David Caro)
[15:52:57] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM:" [puppet] - 10https://gerrit.wikimedia.org/r/869235 (owner: 10David Caro)
[15:53:11] <wikibugs>	 (03PS6) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224
[15:54:30] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.discovery.service-route
[15:54:32] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0)
[15:55:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[15:55:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2001.codfw.wmnet
[15:56:43] <wikibugs>	 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067 (10jcrespo) My take: {F35876366}
[15:57:54] <wikibugs>	 (03PS1) 10Elukey: sre.k8s.pool-depool-cluster: update SAL/log description and add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677)
[15:58:17] <wikibugs>	 (03PS7) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224
[16:00:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[16:00:46] <wikibugs>	 (03PS8) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224
[16:02:10] <RhinosF1>	 jynus: I love that logo!
[16:02:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[16:02:56] <wikibugs>	 (03PS9) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224
[16:03:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38888/console" [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[16:04:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[16:05:36] <wikibugs>	 (03PS10) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224
[16:06:38] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38889/console" [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[16:06:46] <wikibugs>	 (03PS1) 10Muehlenhoff: mirrors: Enable profile::auto_restarts::service for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/869238 (https://phabricator.wikimedia.org/T135991)
[16:07:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond)
[16:08:36] <jynus>	 RhinosF1: thanks
[16:17:24] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/869238 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[16:19:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling restart_daemons on A:ldap-replicas
[16:20:54] <wikibugs>	 (03PS3) 10David Caro: metricsinfra: use epp templates [puppet] - 10https://gerrit.wikimedia.org/r/868631
[16:20:56] <wikibugs>	 (03PS7) 10David Caro: metricsinfra: add optional basic auth to project_proxy [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714)
[16:20:58] <wikibugs>	 (03CR) 10David Caro: metricsinfra: add optional basic auth to project_proxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro)
[16:21:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling restart_daemons on A:ldap-replicas
[16:22:48] <wikibugs>	 (03CR) 10David Caro: metricsinfra: add optional basic auth to project_proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro)
[16:24:33] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[16:27:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] mirrors: Enable profile::auto_restarts::service for rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/869238 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[16:29:33] <jinxer-wm>	 (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on cloudcumin1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[16:29:37] <moritzm>	 !log installing virglrenderer security updates
[16:29:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:43] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "If we're doing this I think we should have separate passwords for the entries in ::trusted_hosts. But I'm not sure if this is the best mec" [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro)
[16:32:47] <wikibugs>	 10SRE, 10VPS-project-Codesearch, 10Patch-For-Review: Add operations/software/purged to Codesearch - https://phabricator.wikimedia.org/T303434 (10Krinkle) 05Open→03Resolved a:03Krinkle
[16:33:33] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38890/console" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868631 (owner: 10David Caro)
[16:33:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[16:34:11] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cumin::cloud_master: configure openstack backend [puppet] - 10https://gerrit.wikimedia.org/r/869212 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[16:38:01] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+1] "looks ok to me" [puppet] - 10https://gerrit.wikimedia.org/r/868631 (owner: 10David Caro)
[16:38:26] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] metricsinfra: use epp templates (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868631 (owner: 10David Caro)
[16:38:37] <moritzm>	 !log installing node-json-schema security updates
[16:38:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:16] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.discovery.service-route
[16:40:18] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=99)
[16:40:42] <wikibugs>	 10SRE-OnFire, 10Discovery-Search (Current work), 10Sustainability (Incident Followup): Evaluate options to soften wdqs paging - https://phabricator.wikimedia.org/T325324 (10MPhamWMF)
[16:43:36] <wikibugs>	 (03CR) 10Jelto: vrts: add vrts2001 values and add database port in config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868467 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth)
[16:44:08] <wikibugs>	 (03PS11) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224
[16:44:35] <wikibugs>	 10SRE, 10Traffic: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T325563 (10ssingh)
[16:44:39] <moritzm>	 !log installing node-tar security updates
[16:44:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:46:13] <wikibugs>	 (03CR) 10Stef Dunlap: Fixup development tooling for wider compatibility (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/845680 (owner: 10Stef Dunlap)
[16:48:36] <wikibugs>	 (03PS1) 10Volans: cumin:cloud_master: fix ssh_config for bastions [puppet] - 10https://gerrit.wikimedia.org/r/869245 (https://phabricator.wikimedia.org/T319401)
[16:49:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869245 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[16:50:11] <wikibugs>	 10SRE, 10Traffic: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T325563 (10ssingh) p:05Triage→03Medium
[16:51:05] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cumin:cloud_master: fix ssh_config for bastions [puppet] - 10https://gerrit.wikimedia.org/r/869245 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[16:51:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2002.codfw.wmnet
[16:51:52] <wikibugs>	 (03PS1) 10Ssingh: Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869267 (https://phabricator.wikimedia.org/T325563)
[16:52:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869267 (https://phabricator.wikimedia.org/T325563) (owner: 10Ssingh)
[16:58:13] <wikibugs>	 (03PS1) 10Volans: cloud cumin: fix authorized keys for cumin [puppet] - 10https://gerrit.wikimedia.org/r/869268 (https://phabricator.wikimedia.org/T319401)
[16:58:19] <wikibugs>	 (03PS1) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[16:58:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2002.codfw.wmnet
[16:59:33] <wikibugs>	 (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/869268 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[16:59:42] <wikibugs>	 (03PS23) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977)
[17:00:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[17:01:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[17:02:23] <wikibugs>	 (03PS2) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[17:03:53] <wikibugs>	 (03PS24) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977)
[17:05:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[17:12:46] <wikibugs>	 (03CR) 10Volans: [C: 03+2] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/868443 (https://phabricator.wikimedia.org/T325168) (owner: 10Jbond)
[17:13:14] <wikibugs>	 (03PS25) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977)
[17:13:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[17:16:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm but i think we are missing setting profile::openstack::eqiad1::cumin::permit_port_forwarding: true" [puppet] - 10https://gerrit.wikimedia.org/r/869268 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[17:18:00] <wikibugs>	 (03PS26) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977)
[17:18:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[17:19:25] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I've interrupted the hosts:auto PCC as it was compiling too many hosts." [puppet] - 10https://gerrit.wikimedia.org/r/869268 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[17:19:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/869268 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[17:19:58] <wikibugs>	 (03CR) 10David Caro: metricsinfra: add optional basic auth to project_proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro)
[17:25:53] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cloud cumin: fix authorized keys for cumin [puppet] - 10https://gerrit.wikimedia.org/r/869268 (https://phabricator.wikimedia.org/T319401) (owner: 10Volans)
[17:27:30] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] metricsinfra: add optional basic auth to project_proxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868727 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro)
[17:37:52] <wikibugs>	 (03PS27) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977)
[17:38:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[17:41:29] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra: fix alertmanager template [puppet] - 10https://gerrit.wikimedia.org/r/869272
[17:42:08] <wikibugs>	 (03PS28) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977)
[17:43:44] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::metricsinfra: fix alertmanager template [puppet] - 10https://gerrit.wikimedia.org/r/869272 (owner: 10Majavah)
[17:46:42] <wikibugs>	 (03PS12) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224
[17:47:12] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra: fix spacing in alertmanager default file [puppet] - 10https://gerrit.wikimedia.org/r/869273
[17:47:49] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett)
[17:51:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] P:wmcs::metricsinfra: fix spacing in alertmanager default file [puppet] - 10https://gerrit.wikimedia.org/r/869273 (owner: 10Majavah)
[17:52:49] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett)
[17:55:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) -----BEGIN OPENSSH PRIVATE KEY----- b3BlbnNzaC1rZXktdjEAAAAACmFlczI1Ni1jdHIAAAAGYmNyeXB0AAAAGAAAABAGGqyGaf TU2DE...
[17:55:18] <sukhe>	 ^ er
[17:56:39] <sukhe>	 they are only in ldap so not a big deal as the patch is for shell access 
[17:57:42] <wikibugs>	 (03PS29) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977)
[17:58:43] <wikibugs>	 (03Abandoned) 10Cathal Mooney: Base config additions and updated templates to configure EVPN ASW [homer/public] - 10https://gerrit.wikimedia.org/r/759709 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[17:58:45] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38898/console" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[18:02:38] <wikibugs>	 (03PS3) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[18:03:26] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "ready for a new set of reviews" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[18:04:04] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] deployment_server: add keyholder/group config for jenkins-ci deploy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn)
[18:06:08] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Prod-Kubernetes, and 3 others: Write a cookbook to set a k8s cluster in maintenance mode - https://phabricator.wikimedia.org/T277677 (10elukey) I had a chat with Janis, and this is what I am going to do:  1) Refactor where possible `re.k8s.pool-depool-clu...
[18:06:51] <wikibugs>	 (03PS30) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977)
[18:07:52] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38899/console" [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond)
[18:11:28] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF)
[18:12:47] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett)
[18:16:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Jcross) Approved
[18:16:18] <wikibugs>	 (03PS1) 10Dzahn: admin: create new group deployment-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014)
[18:16:42] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] deployment_server: add keyholder/group config for jenkins-ci deploy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn)
[18:17:18] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "wait for https://gerrit.wikimedia.org/r/869276 and amend to use that new group" [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn)
[18:17:48] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10DBu-WMF) @Vgutierrez sorry to be a pain but if there is any possibility that we can get this done quickly it would be amazing.  We have the...
[18:19:24] <wikibugs>	 (03PS3) 10Dzahn: deployment_server: add keyholder/group config for jenkins-ci deploy [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014)
[18:19:53] <wikibugs>	 (03PS4) 10Dzahn: deployment_server: add keyholder/group config for jenkins-ci deploy [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014)
[18:20:53] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "contint-roots, not contint-admins" [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn)
[18:21:35] <wikibugs>	 (03PS2) 10Dzahn: admin: create new group deployment-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014)
[18:23:36] <wikibugs>	 (03PS1) 10Vgutierrez: wikimedia.org: Add cert validation records for links.email [dns] - 10https://gerrit.wikimedia.org/r/869277 (https://phabricator.wikimedia.org/T188561)
[18:26:57] <wikibugs>	 (03PS1) 10Volans: cloud: authorize cumin from the bastion [puppet] - 10https://gerrit.wikimedia.org/r/869278 (https://phabricator.wikimedia.org/T319401)
[18:28:32] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2 C: 03+2] wikimedia.org: Add cert validation records for links.email [dns] - 10https://gerrit.wikimedia.org/r/869277 (https://phabricator.wikimedia.org/T188561) (owner: 10Vgutierrez)
[18:29:05] <sukhe>	 !log running authdns-update for Gerrit: 869277: T188561
[18:29:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:29:10] <stashbot>	 T188561: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561
[18:33:37] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10ssingh) >>! In T188561#8479178, @DBu-WMF wrote: > @Vgutierrez sorry to be a pain but if there is any possibility that we can get this done...
[18:36:36] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto)
[18:42:56] <wikibugs>	 (03PS2) 10Jbond: monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/868471 (https://phabricator.wikimedia.org/T321783)
[18:44:46] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 3 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10EWilfong_WMF) Thank you, @ssingh.  These changes look good to me and I am asking Acoustic to verify.  @DBu-WMF, Brian Sisolak and I will fo...
[18:51:12] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Ottomata) Approved.  This looks ssh + kerberos access too.
[18:55:48] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: manage symlink to output dir [puppet] - 10https://gerrit.wikimedia.org/r/869280
[18:59:36] <wikibugs>	 (03PS1) 10RobH: config a 15gen updates [software] - 10https://gerrit.wikimedia.org/r/869281
[18:59:56] <wikibugs>	 (03CR) 10RobH: [C: 03+2] config a 15gen updates [software] - 10https://gerrit.wikimedia.org/r/869281 (owner: 10RobH)
[19:00:27] <wikibugs>	 (03Merged) 10jenkins-bot: config a 15gen updates [software] - 10https://gerrit.wikimedia.org/r/869281 (owner: 10RobH)
[19:00:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:03:15] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.reimage for host logstash2003.codfw.wmnet with OS bullseye
[19:06:21] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett) >>! In T323943#8479265, @Ottomata wrote: > This looks ssh + kerberos access too.  Yes.
[19:10:53] <icinga-wm>	 PROBLEM - Disk space on maps1009 is CRITICAL: DISK CRITICAL - free space: / 2769 MB (3% inode=97%): /tmp 2769 MB (3% inode=97%): /var/tmp 2769 MB (3% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=maps1009&var-datasource=eqiad+prometheus/ops
[19:14:00] <wikibugs>	 (03Abandoned) 10Ssingh: Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869267 (https://phabricator.wikimedia.org/T325563) (owner: 10Ssingh)
[19:15:56] <wikibugs>	 (03PS1) 10Ssingh: Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869282 (https://phabricator.wikimedia.org/T325563)
[19:16:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869282 (https://phabricator.wikimedia.org/T325563) (owner: 10Ssingh)
[19:30:05] <logmsgbot>	 !log cwhite@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on logstash2003.codfw.wmnet with reason: host reimage
[19:30:59] <wikibugs>	 (03CR) 10Aaron Schulz: "Flow has been fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/525977 (owner: 10Aaron Schulz)
[19:32:45] <wikibugs>	 (03PS2) 10Ssingh: Release 9.1.4-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869282 (https://phabricator.wikimedia.org/T325563)
[19:33:09] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on logstash2003.codfw.wmnet with reason: host reimage
[20:03:53] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T325563 (10ssingh)
[20:06:07] <wikibugs>	 10SRE, 10serviceops: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10aaron) >>! In T319434#8383918, @Dzahn wrote: > per T316223#8381863 serviceops-core is taking this over  Let us know if there is anything you need from the perf team.
[20:32:13] <wikibugs>	 (03PS2) 10Eevans: echostore: Tighten egress to explit host/port list [deployment-charts] - 10https://gerrit.wikimedia.org/r/868146
[20:32:19] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] echostore: Tighten egress to explit host/port list [deployment-charts] - 10https://gerrit.wikimedia.org/r/868146 (owner: 10Eevans)
[20:37:32] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Marostegui) @KHurd-WMF your private key was disclosed. Please make sure to generate another pair of private/public key
[20:38:41] <logmsgbot>	 !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/echostore: apply
[20:39:25] <logmsgbot>	 !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/echostore: apply
[20:39:30] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10sbassett) >>! In T323943#8479804, @Marostegui wrote: > @KHurd-WMF your private key was disclosed. Please make sure to gener...
[20:40:27] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Marostegui) Excellent! Thank you for clarifying it!
[20:41:32] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) Yes, sorry for my noobness. That is a new key.
[20:41:46] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10KHurd-WMF) Everything should be completed on my end, at this point.
[20:42:06] <logmsgbot>	 !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/echostore: apply
[20:42:29] <logmsgbot>	 !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply
[21:23:48] <jeena>	 jouncebot: now
[21:23:48] <jouncebot>	 For the next 10 hour(s) and 36 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221219T0800)
[21:23:57] <jeena>	 Going to do a scap release
[21:25:08] <logmsgbot>	 !log jhuneidi@deploy1002 Installing scap version "4.30.3" for 563 hosts
[21:25:36] <logmsgbot>	 !log jhuneidi@deploy1002 Installation of scap version "4.30.3" completed for 563 hosts
[21:34:11] <icinga-wm>	 PROBLEM - Check systemd state on logstash2003 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:42:03] <wikibugs>	 (03PS1) 10JHathaway: Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597)
[21:43:13] <wikibugs>	 (03CR) 10JHathaway: "kindly review!" [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[21:43:26] <wikibugs>	 10SRE, 10DiscussionTools, 10MW-1.40-notes (1.40.0-wmf.17; 2023-01-02), 10Patch-For-Review, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10matmarex) 05Open→03Resolved I guess this is resolved. Thank you all for the fixe...
[21:45:24] <wikibugs>	 (03PS2) 10JHathaway: Add vendored module bodgit/puppet-postfix [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396)
[21:47:05] <wikibugs>	 (03CR) 10JHathaway: Add vendored module bodgit/puppet-postfix (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway)
[21:55:01] <wikibugs>	 (03PS2) 10JHathaway: Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597)
[21:55:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[21:58:19] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase-dev[1004-1006].eqiad.wmnet
[22:06:37] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.dns.netbox
[22:09:00] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev[1004-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001"
[22:10:21] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev[1004-1006].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001"
[22:10:21] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:10:22] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase-dev[1004-1006].eqiad.wmnet
[22:19:17] <wikibugs>	 (03PS1) 10Andrew Bogott: eqiad1 cumin master: include observer project in config [puppet] - 10https://gerrit.wikimedia.org/r/869319
[22:21:20] <wikibugs>	 (03PS1) 10Eevans: Decommissioning restbase-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/869320 (https://phabricator.wikimedia.org/T325387)
[22:23:31] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[22:24:12] <wikibugs>	 (03PS3) 10JHathaway: Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597)
[22:24:31] <wikibugs>	 (03CR) 10jenkins-bot: Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[22:24:40] <logmsgbot>	 !log bking@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
[22:25:02] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[22:25:10] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: auto-extract kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[22:25:17] <wikibugs>	 (03PS17) 10Bking: wdqs: auto-extract kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114)
[22:25:30] <wikibugs>	 (03CR) 10Bking: [V: 03+2] wdqs: auto-extract kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[22:25:32] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] wdqs: auto-extract kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[22:25:56] <wikibugs>	 (03PS4) 10JHathaway: Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597)
[22:27:09] <wikibugs>	 (03Merged) 10jenkins-bot: wdqs: auto-extract kafka timestamp [cookbooks] - 10https://gerrit.wikimedia.org/r/868198 (https://phabricator.wikimedia.org/T325114) (owner: 10Bking)
[22:34:32] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[22:37:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2001.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2001.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:38:06] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0)
[22:38:08] <inflatador>	 ^^ we're doing maintenance and this alert should have been silenced! will ack
[22:38:11] <icinga-wm>	 PROBLEM - Host wdqs2001 is DOWN: PING CRITICAL - Packet loss = 100%
[22:38:21] <icinga-wm>	 RECOVERY - Host wdqs2001 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms
[22:43:09] <ryankemper>	 !log [WDQS] Pooled `wdqs2007` (was depooled, we may have forgotten to re-pool it in the last week or so)
[22:43:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:21] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:50:23] <wikibugs>	 (03PS2) 10Cwhite: site: assign role logging::opensearch::data to logstash203[67] [puppet] - 10https://gerrit.wikimedia.org/r/867631 (https://phabricator.wikimedia.org/T321335)
[22:51:28] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] site: assign role logging::opensearch::data to logstash203[67] [puppet] - 10https://gerrit.wikimedia.org/r/867631 (https://phabricator.wikimedia.org/T321335) (owner: 10Cwhite)
[22:51:51] <icinga-wm>	 RECOVERY - Check systemd state on logstash2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:56:15] <ryankemper>	 !log [WDQS] Pooled `wdqs2005`
[22:56:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:56:34] <wikibugs>	 (03PS1) 10Cwhite: logstash: add logstash 203[67] to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/869250 (https://phabricator.wikimedia.org/T321335)
[22:56:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] logstash: add logstash 203[67] to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/869250 (https://phabricator.wikimedia.org/T321335) (owner: 10Cwhite)
[22:57:20] <wikibugs>	 10SRE, 10DiscussionTools, 10MW-1.40-notes (1.40.0-wmf.17; 2023-01-02), 10Patch-For-Review, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10jcrespo) I also need to write an incident report- I will probably require some of yo...
[22:57:58] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[22:58:04] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[22:58:10] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[22:58:10] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[22:58:16] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[22:58:16] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
[22:58:21] <wikibugs>	 (03PS2) 10Cwhite: logstash: add logstash 203[67] to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/869250 (https://phabricator.wikimedia.org/T321335)
[22:58:28] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[22:58:33] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[22:59:18] <ryankemper>	 !log [WDQS] Continuing with reboot of WDQS hosts. Doing 1 host each of `[eqiad, codfw]` X `[internal, public]`, so 4 total hosts at once
[22:59:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:59:51] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add logstash 203[67] to cluster config [puppet] - 10https://gerrit.wikimedia.org/r/869250 (https://phabricator.wikimedia.org/T321335) (owner: 10Cwhite)
[23:00:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:01:31] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0)
[23:01:33] <icinga-wm>	 PROBLEM - Host wdqs2002 is DOWN: PING CRITICAL - Packet loss = 100%
[23:01:35] <icinga-wm>	 RECOVERY - Host wdqs2002 is UP: PING OK - Packet loss = 0%, RTA = 33.27 ms
[23:01:42] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0)
[23:01:59] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:02:02] <inflatador>	 I guess the silences aren't working for that reboot cookbook
[23:02:43] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2002.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2002.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:03:39] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1004 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[23:05:15] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[23:05:32] <logmsgbot>	 !log cwhite@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host logstash2003.codfw.wmnet with OS bullseye
[23:05:35] <wikibugs>	 (03CR) 10Dzahn: "Because of the way that decom'ing works nowadays, I would normally do all the things _except_ the removal from site.pp (either keep the pr" [puppet] - 10https://gerrit.wikimedia.org/r/869320 (https://phabricator.wikimedia.org/T325387) (owner: 10Eevans)
[23:06:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] vrts: Enable vrts profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/868663 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[23:07:50] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0)
[23:07:51] <icinga-wm>	 PROBLEM - Host wdqs1003 is DOWN: PING CRITICAL - Packet loss = 100%
[23:07:53] <icinga-wm>	 RECOVERY - Host wdqs1003 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[23:08:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:09:07] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:14:41] <wikibugs>	 10SRE, 10serviceops: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10Dzahn) @aaron CCing @akosiaris Depending how you want to look at it this is either a subtask and unblocks or a duplicate of T316223.  Also see T316223#8383941, T316223#8185277.  Cheers
[23:17:58] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332
[23:23:53] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332
[23:30:46] <logmsgbot>	 !log bking@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0)
[23:30:48] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2012.*
[23:31:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (owner: 10Andrew Bogott)
[23:31:34] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:31:50] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs2005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 244 bytes in 1.175 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:32:00] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2005 is CRITICAL: connect to address 127.0.0.1 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[23:32:16] <ryankemper>	 !log [WDQS] Temporarily removing wdqs20[09-12] from pybal; these are new hosts that aren't ready for service until data reload has completed (long-running process). In meantime, remove these so they don't factor into pybal's depool threshold
[23:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:32:27] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2011.*
[23:32:32] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2010.*
[23:32:39] <logmsgbot>	 !log ryankemper@puppetmaster1001 conftool action : set/weight=10:pooled=inactive; selector: name=wdqs2009.*
[23:32:50] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs2005 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.217 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[23:33:02] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs2005 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[23:33:36] <ryankemper>	 Sorry for the WDQS noise; doing my best to fiddle with stuff to make the reboots less noisy. Looks like our reboot cookbook removes the silences before doing a follow-up puppet run to ensure the service is in a good state, so I'm working on patching that
[23:33:38] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:38:27] <wikibugs>	 (03PS1) 10Cwhite: logstash: clean up curator actions todo items [puppet] - 10https://gerrit.wikimedia.org/r/869251 (https://phabricator.wikimedia.org/T301760)
[23:45:02] <jinxer-wm>	 (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert
[23:45:12] <wikibugs>	 (03PS1) 10Cwhite: logstash: change ecs-default clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869252
[23:45:14] <wikibugs>	 (03PS1) 10Cwhite: logstash: change ecs-test clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869253
[23:45:16] <wikibugs>	 (03PS1) 10Cwhite: logstash: change w3creportingapi clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869254
[23:45:33] <wikibugs>	 (03PS1) 10Ahmon Dancy: train-presync: Pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/869333 (https://phabricator.wikimedia.org/T325576)
[23:45:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] logstash: change w3creportingapi clean up policy to prefix [puppet] - 10https://gerrit.wikimedia.org/r/869254 (owner: 10Cwhite)
[23:47:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] train-presync: Pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/869333 (https://phabricator.wikimedia.org/T325576) (owner: 10Ahmon Dancy)
[23:48:42] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: improve reliability of reboots [cookbooks] - 10https://gerrit.wikimedia.org/r/869334
[23:49:40] <wikibugs>	 (03PS2) 10Ahmon Dancy: train-presync: Pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/869333 (https://phabricator.wikimedia.org/T325576)
[23:50:55] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.reboot
[23:51:57] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "Thanks!" [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 (owner: 10RLazarus)
[23:52:58] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10Eileenmcnaughton)
[23:53:17] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Albertoleoncio) Including SRE as it involves Google Search Console
[23:53:37] <wikibugs>	 (03Merged) 10jenkins-bot: Refactor: Migrate from attrs to dataclasses [software/httpbb] - 10https://gerrit.wikimedia.org/r/865789 (owner: 10RLazarus)
[23:54:00] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[23:55:26] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+1] train-presync: Pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/869333 (https://phabricator.wikimedia.org/T325576) (owner: 10Ahmon Dancy)
[23:55:31] <wikibugs>	 (03Merged) 10jenkins-bot: Refactor: Wrap verify_certs inside an Options type. [software/httpbb] - 10https://gerrit.wikimedia.org/r/865790 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)
[23:59:49] <wikibugs>	 (03CR) 10RLazarus: Add an option, off by default, to retry once when a request times out. (031 comment) [software/httpbb] - 10https://gerrit.wikimedia.org/r/865791 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus)