[00:20:38] RECOVERY - Check systemd state on db2137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:25:54] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [00:38:18] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 12.10 ms [00:39:10] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:04] PROBLEM - Check systemd state on dbstore1005 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@staging.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:43:18] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:55:55] (03CR) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with : (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [00:56:48] (03PS14) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 [01:07:00] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [01:07:44] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [01:08:07] (03PS4) 10Andrew Bogott: OpenStack HAProxy: support frontend ferm rules into haproxy [puppet] - 10https://gerrit.wikimedia.org/r/845063 (https://phabricator.wikimedia.org/T319312) [01:08:09] (03PS5) 10Andrew Bogott: OpenStack nova: move the frontend firewall handling to haproxy code [puppet] - 10https://gerrit.wikimedia.org/r/845064 (https://phabricator.wikimedia.org/T319312) [01:08:30] (03CR) 10Andrew Bogott: OpenStack HAProxy: support frontend ferm rules into haproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845063 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [01:08:56] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [01:09:21] (03CR) 10CI reject: [V: 04-1] OpenStack HAProxy: support frontend ferm rules into haproxy [puppet] - 10https://gerrit.wikimedia.org/r/845063 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [01:09:34] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [01:10:43] (03CR) 10Andrew Bogott: [C: 03+2] openstack: encapi: support returning data in JSON [puppet] - 10https://gerrit.wikimedia.org/r/845869 (https://phabricator.wikimedia.org/T318503) (owner: 10Majavah) [01:11:32] (03CR) 10Andrew Bogott: [C: 03+2] openstack: wmcs-enc-cli: explicitely set accept/content-type as yaml [puppet] - 10https://gerrit.wikimedia.org/r/845868 (https://phabricator.wikimedia.org/T318503) (owner: 10Majavah) [01:12:20] (03CR) 10Andrew Bogott: [C: 03+2] openstack: encapi: reformat with black [puppet] - 10https://gerrit.wikimedia.org/r/845623 (owner: 10Majavah) [01:14:15] (03PS5) 10Andrew Bogott: OpenStack HAProxy: support frontend ferm rules into haproxy [puppet] - 10https://gerrit.wikimedia.org/r/845063 (https://phabricator.wikimedia.org/T319312) [01:14:17] (03PS6) 10Andrew Bogott: OpenStack nova: move the frontend firewall handling to haproxy code [puppet] - 10https://gerrit.wikimedia.org/r/845064 (https://phabricator.wikimedia.org/T319312) [01:14:50] (03CR) 10CI reject: [V: 04-1] OpenStack HAProxy: support frontend ferm rules into haproxy [puppet] - 10https://gerrit.wikimedia.org/r/845063 (https://phabricator.wikimedia.org/T319312) (owner: 10Andrew Bogott) [01:16:49] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [01:17:58] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:28:22] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: (7) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:17] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [01:42:10] (03PS6) 10Andrew Bogott: OpenStack HAProxy: support frontend ferm rules into haproxy [puppet] - 10https://gerrit.wikimedia.org/r/845063 (https://phabricator.wikimedia.org/T319312) [01:42:12] (03PS7) 10Andrew Bogott: OpenStack nova: move the frontend firewall handling to haproxy code [puppet] - 10https://gerrit.wikimedia.org/r/845064 (https://phabricator.wikimedia.org/T319312) [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:45] (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:58] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:02] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 2 (graphite1005, ...), Fresh: 122 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [02:49:02] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:09:02] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:10:06] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:12:16] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [03:18:06] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [03:24:49] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [03:30:14] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:45:28] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [03:50:28] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:10:28] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:14:38] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [04:17:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:50:39] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Marostegui) >>! In T320786#8335125, @jcrespo wrote: > ` > > @Ladsgroup do you want me to recover data to this host? Is this needed? From what I can read here the host didn't really crash... [05:06:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:11:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:19:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:21:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:21:34] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [05:26:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1016:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:30:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:35:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:40:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:45:12] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [05:47:18] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [05:52:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:00:02] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Marostegui) So this host has had mysql stopped for 6 days from what I can see. What's pending to be able to start it? [06:06:46] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:08:57] ACKNOWLEDGEMENT - MariaDB Replica IO: staging on dbstore1005 is CRITICAL: CRITICAL slave_io_state could not connect Marostegui https://phabricator.wikimedia.org/T321464 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:08:57] ACKNOWLEDGEMENT - MariaDB Replica Lag: staging on dbstore1005 is CRITICAL: CRITICAL slave_sql_lag could not connect Marostegui https://phabricator.wikimedia.org/T321464 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:08:57] ACKNOWLEDGEMENT - MariaDB Replica SQL: staging on dbstore1005 is CRITICAL: CRITICAL slave_sql_state could not connect Marostegui https://phabricator.wikimedia.org/T321464 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:08:57] ACKNOWLEDGEMENT - MariaDB read only staging on dbstore1005 is CRITICAL: Could not connect to localhost:3350 Marostegui https://phabricator.wikimedia.org/T321464 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [06:17:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:18:36] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:18:38] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:22:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:26:50] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:26:52] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [06:29:28] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:30:34] RECOVERY - MariaDB Replica IO: staging on dbstore1005 is OK: OK slave_io_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:30:38] RECOVERY - MariaDB read only staging on dbstore1005 is OK: Version 10.4.22-MariaDB, Uptime 11s, read_only: False, event_scheduler: True, 11.68 QPS, connection latency: 0.004652s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [06:31:14] RECOVERY - MariaDB Replica Lag: staging on dbstore1005 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:32:14] RECOVERY - mysqld processes on dbstore1005 is OK: PROCS OK: 4 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [06:32:22] RECOVERY - MariaDB Replica SQL: staging on dbstore1005 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:00:05] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221024T0700). [07:00:05] matthiasmullie and Sohom_Datta: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:29] o/ [07:01:46] (03PS3) 10Matthias Mullie: Fix value for wgQuickViewMediaRepositorySearchUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844485 [07:01:58] o/ [07:02:04] mine is beta-only; I'll go ahead and merge that [07:02:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844485 (owner: 10Matthias Mullie) [07:03:04] (03Merged) 10jenkins-bot: Fix value for wgQuickViewMediaRepositorySearchUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/844485 (owner: 10Matthias Mullie) [07:03:31] I'm done [07:08:55] <_joe_> Sohom_Datta: I would hep with your patches but I don't know enough about those things and there's no +1 on the patches [07:09:19] <_joe_> looks like neither urbanecm nor Amir1 are available thins morning [07:09:31] I am [07:09:34] good morning [07:09:37] let me see [07:10:35] <_joe_> ehehe [07:10:46] <_joe_> the second one is a backport, I was confident enough to merge it [07:11:29] The config changes are a follow-up for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ProofreadPage/+/841998 [07:11:32] (03CR) 10Ladsgroup: [C: 03+2] Fix floating footer and wikieditor UI issue. [extensions/ProofreadPage] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845751 (https://phabricator.wikimedia.org/T321344) (owner: 10Sohom Datta) [07:12:17] 10SRE, 10SRE-OnFire, 10Data-Persistence, 10User-notice, 10Wikimedia-Incident: s6 master failure - https://phabricator.wikimedia.org/T320990 (10Marostegui) Great work everyone! [07:12:53] Sohom_Datta and _joe_: I think it's good to go. Let's wait for the backport to merge and deploy [07:13:20] <_joe_> Amir1: don't you use "scap backport" to merge? [07:13:37] 10SRE, 10SRE-OnFire, 10Data-Persistence, 10User-notice, 10Wikimedia-Incident: s6 master failure - https://phabricator.wikimedia.org/T320990 (10Marostegui) Now that we have the incident report, can this be closed? [07:13:59] nah, specially for backport I don't want to wait in deploy1002 for twenty minutes [07:14:12] but after merge I use it to deploy [07:15:20] <_joe_> I have a couple patches of mine to merge too [07:15:25] <_joe_> once you've done [07:20:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:23:40] 10SRE, 10SRE-OnFire, 10Data-Persistence, 10User-notice, 10Wikimedia-Incident: s6 master failure - https://phabricator.wikimedia.org/T320990 (10Ladsgroup) 05Open→03Resolved Yup [07:25:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [07:26:28] (03Merged) 10jenkins-bot: Fix floating footer and wikieditor UI issue. [extensions/ProofreadPage] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845751 (https://phabricator.wikimedia.org/T321344) (owner: 10Sohom Datta) [07:29:10] <_joe_> 15 minutes of CI for a js one-liner lol [07:29:21] <_joe_> none of those 15 minutes has any value for this patch [07:30:14] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:32:38] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:37:10] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) (owner: 10Cwhite) [07:37:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/ProofreadPage] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/845751 (https://phabricator.wikimedia.org/T321344) (owner: 10Sohom Datta) [07:37:26] Sohom_Datta: can you test the backport patch? [07:37:41] I'll check [07:37:52] let me know once it's there [07:38:16] _joe_: yup :( [07:38:18] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:41:49] <_joe_> Amir1: lmk when I can proceed with my patches [07:42:12] sure, i'm waiting for the check in mwdebug [07:44:46] Yep, the changes are visible via mwdebug :) [07:45:40] awesome [07:46:31] (03CR) 10David Caro: [C: 03+1] "👍" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [07:49:45] can someone bring logmsgbot back please? https://wikitech.wikimedia.org/wiki/Logmsgbot are the docs. [07:50:09] urbanecm: let me try [07:50:15] thanks [07:52:41] hmm, I can't ssh to icinga1001 and the host status says it can't find it https://wikitech.wikimedia.org/wiki/Icinga1001 [07:53:36] aha, it's replaced [07:53:36] The config changes show up as a merge conflict on gerrit, is that a issue ? [07:53:46] # new alert (icinga + alertmanager) systems, replacing icinga[12]001 (T255072, T255070) [07:53:47] T255072: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 [07:53:47] T255070: (Need By:TBD) rack/setup/install alert2001 - https://phabricator.wikimedia.org/T255070 [07:54:04] Sohom_Datta: it shouldn't [07:54:12] (03PS5) 10Ladsgroup: Enable source links on Translation ns on enwikisource and thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843593 (https://phabricator.wikimedia.org/T53980) (owner: 10Sohom Datta) [07:54:18] Oh, okay cool cool :) [07:54:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843593 (https://phabricator.wikimedia.org/T53980) (owner: 10Sohom Datta) [07:55:32] (03Merged) 10jenkins-bot: Enable source links on Translation ns on enwikisource and thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/843593 (https://phabricator.wikimedia.org/T53980) (owner: 10Sohom Datta) [07:56:26] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10jcrespo) > Is this needed? From what I can read here the host didn't really crash or anything, right? It did- the controller locked itself [presumably] while running megacli commands to de... [07:57:10] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Marostegui) Right - should we start mysql, let it catch up and run a data check? (I can take care of that) [07:57:55] Sohom_Datta: live in mwdebug [07:58:35] urbanecm: restarted, that's the most I can do :D [07:58:43] seems it's back :) [07:58:51] thank you Amir1 [07:59:31] I could have done it sooner if the docs were not outdated, let's save time from the future [07:59:45] Yep, can see it on en.wikisource and thwikisource :) [07:59:49] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10jcrespo) FYI, I offered to do a data reload at T320786#8335125 but @Ladsgroup asked to recover it from backups himself for learning purposes; I handover any decision to the 2 of you. [08:00:04] !log Starting october reboots of lingering wikikube eqiad hosts [08:00:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:53] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=kubernetes,service=kubesvc,name=kubernetes1006.eqiad.wmnet [08:04:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 32787 [08:05:04] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:843593|Enable source links on Translation ns on enwikisource and thwikisource (T53980)]] (duration: 09m 18s) [08:05:10] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32787 [08:05:10] T53980: Source tab not showing up in the Translation namespace - https://phabricator.wikimedia.org/T53980 [08:07:12] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 132337 [08:07:15] Amir1: Thanks a bunch for the deploys :) [08:07:44] :) [08:07:53] _joe_: I'm done [08:08:04] <_joe_> Amir1: and now I have to go afk [08:08:10] :D [08:08:26] it's the cluster config? [08:08:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 132337 [08:09:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P35982 and previous config saved to /var/cache/conftool/dbconfig/20221024-080942-ladsgroup.json [08:09:49] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 8075 [08:10:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P35983 and previous config saved to /var/cache/conftool/dbconfig/20221024-081033-ladsgroup.json [08:12:25] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Degraded RAID on db1202 - https://phabricator.wikimedia.org/T320786 (10Marostegui) We've agreed to start mysql now as he'll use test hosts for recovery testing. [08:12:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 8075 [08:14:38] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:16:43] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 13335 [08:19:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 13335 [08:19:18] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1006.eqiad.wmnet [08:20:36] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:20:52] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Connect - kubernetes-eqiad, AS64601/IPv6: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:22:57] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 3303 [08:24:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P35984 and previous config saved to /var/cache/conftool/dbconfig/20221024-082448-ladsgroup.json [08:25:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321312)', diff saved to https://phabricator.wikimedia.org/P35985 and previous config saved to /var/cache/conftool/dbconfig/20221024-082540-ladsgroup.json [08:25:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [08:25:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [08:26:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T321312)', diff saved to https://phabricator.wikimedia.org/P35986 and previous config saved to /var/cache/conftool/dbconfig/20221024-082605-ladsgroup.json [08:26:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 3303 [08:26:14] (03CR) 10Vgutierrez: [C: 03+1] "LGTM as a temporary measure but see the feedback on the task" [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) (owner: 10Cwhite) [08:26:19] (03PS1) 10David Caro: gitlab_runner: add toolforge ci images to allowed list [puppet] - 10https://gerrit.wikimedia.org/r/848186 [08:26:59] !log kubernetes1006:~$ sudo systemctl reset-failed ifup@ens13.service T273026 [08:27:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:15] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [08:28:07] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1006.eqiad.wmnet [08:28:20] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10Vgutierrez) 05In progress→03Resolved a:03Vgutierrez [08:28:42] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes,service=kubesvc,name=kubernetes1006.eqiad.wmnet [08:29:30] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=kubernetes,service=kubesvc,name=kubernetes1015.eqiad.wmnet [08:30:51] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1015.eqiad.wmnet [08:31:51] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Ladsgroup) >>! In T211661#8139653, @Joe wrote: > I would love to see some numbers on how many thumbnails get a response from swif... [08:32:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321312)', diff saved to https://phabricator.wikimedia.org/P35987 and previous config saved to /var/cache/conftool/dbconfig/20221024-083242-ladsgroup.json [08:33:04] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:33:20] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:35:27] !log set thanos ring replicas to 3.50 T311690 [08:35:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:32] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [08:36:21] (03PS3) 10Jbond: prometheus: move service_catalog_targets under ::targets [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [08:36:23] (03PS1) 10Jbond: puppetmaster: correct IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/848188 [08:36:44] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1045.eqiad.wmnet [08:37:02] !log kubernetes1015:~$ sudo systemctl reset-failed ifup@ens13.service T273026 [08:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:11] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [08:37:15] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1015.eqiad.wmnet [08:37:23] (03CR) 10CI reject: [V: 04-1] puppetmaster: correct IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/848188 (owner: 10Jbond) [08:37:32] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Ladsgroup) Another way that might be much much easier. Delete old thumbnails in small portion of swift and check how much request... [08:37:51] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2049.codfw.wmnet [08:38:04] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes,service=kubesvc,name=kubernetes1015.eqiad.wmnet [08:38:33] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=kubernetes,service=kubesvc,name=kubernetes1016.eqiad.wmnet [08:39:13] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 31042 [08:39:20] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:39:31] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes1016.eqiad.wmnet [08:39:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P35988 and previous config saved to /var/cache/conftool/dbconfig/20221024-083955-ladsgroup.json [08:39:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 31042 [08:40:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [08:40:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [08:40:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:40:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:40:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T321312)', diff saved to https://phabricator.wikimedia.org/P35989 and previous config saved to /var/cache/conftool/dbconfig/20221024-084037-ladsgroup.json [08:41:26] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:41:40] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:42:50] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 199524 [08:42:57] (03PS2) 10Jbond: puppetmaster: correct IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/848188 [08:44:00] (03CR) 10Jelto: [C: 04-1] "This should be added to hieradata/cloud.yaml instead of hieradata/role/common/gitlab_runner.yaml. could.yaml specifies whats allowed for " [puppet] - 10https://gerrit.wikimedia.org/r/848186 (owner: 10David Caro) [08:44:16] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1045.eqiad.wmnet [08:44:17] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 199524 [08:44:31] (03CR) 10Jbond: [C: 03+2] puppetmaster: correct IPv6 address [puppet] - 10https://gerrit.wikimedia.org/r/848188 (owner: 10Jbond) [08:44:54] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes1016.eqiad.wmnet [08:45:25] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes,service=kubesvc,name=kubernetes1016.eqiad.wmnet [08:47:09] (03PS6) 10Filippo Giunchedi: dispatch: add backend role [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) [08:47:11] (03PS6) 10Filippo Giunchedi: WIP: add profile::dispatch [puppet] - 10https://gerrit.wikimedia.org/r/824449 (https://phabricator.wikimedia.org/T313229) [08:47:13] (03PS1) 10Filippo Giunchedi: dispatch: assign backend role [puppet] - 10https://gerrit.wikimedia.org/r/848191 (https://phabricator.wikimedia.org/T313229) [08:47:20] (03CR) 10Jbond: [C: 03+1] "See the previous patch, the case of IPv6 addresses was a random so a 50/50 chance of passing CI. i have now forced lowercase" [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [08:47:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P35990 and previous config saved to /var/cache/conftool/dbconfig/20221024-084748-ladsgroup.json [08:48:31] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1046.eqiad.wmnet [08:50:58] (03CR) 10Jelto: [C: 04-1] gitlab_runner: add toolforge ci images to allowed list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848186 (owner: 10David Caro) [08:51:19] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2049.codfw.wmnet [08:52:10] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37681/console" [puppet] - 10https://gerrit.wikimedia.org/r/848191 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [08:52:42] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2050.codfw.wmnet [08:52:54] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=kubernetes,service=kubemaster,name=kubemaster1001.eqiad.wmnet [08:53:07] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1001.eqiad.wmnet [08:53:36] (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch: add backend role [puppet] - 10https://gerrit.wikimedia.org/r/824448 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [08:53:58] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:54:10] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:54:44] XioNoX: Can that ^ be because of the reboots I'm doing on wikikube? [08:54:53] claime: yes [08:55:03] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: move service_catalog_targets under ::targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845528 (https://phabricator.wikimedia.org/T310266) (owner: 10Filippo Giunchedi) [08:55:18] Did I screw something up, or should I just wait a bit for self-resolve ? [08:55:28] claime: it should self resolve [08:55:31] ack [08:56:15] unfortunately we cant downtime specifically for this but just all BGP status errors from the routers [08:56:16] (03PS1) 10Ladsgroup: Enable LBFactory config callback in CLI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848201 (https://phabricator.wikimedia.org/T298485) [08:56:52] (03CR) 10Volans: "Post-merge comment" [cookbooks] - 10https://gerrit.wikimedia.org/r/845086 (owner: 10Ryan Kemper) [08:57:11] jayme: I see. Thanks [08:58:24] yep exactly [08:59:11] Sorry for the noise then :) [08:59:54] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1001.eqiad.wmnet [09:00:11] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes,service=kubemaster,name=kubemaster1001.eqiad.wmnet [09:00:11] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2050.codfw.wmnet [09:00:13] (03PS2) 10David Caro: gitlab_runner: add toolforge ci images to allowed list [puppet] - 10https://gerrit.wikimedia.org/r/848186 [09:00:15] (03CR) 10David Caro: gitlab_runner: add toolforge ci images to allowed list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848186 (owner: 10David Caro) [09:01:10] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet [09:01:45] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1046.eqiad.wmnet [09:01:47] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,cluster=kubernetes,service=kubemaster,name=kubemaster1002.eqiad.wmnet [09:01:53] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster1002.eqiad.wmnet [09:02:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P35991 and previous config saved to /var/cache/conftool/dbconfig/20221024-090255-ladsgroup.json [09:03:10] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1047.eqiad.wmnet [09:04:22] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv4: Active - kubernetes-eqiad, AS64601/IPv6: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:04:34] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64601/IPv6: Active - kubernetes-eqiad, AS64601/IPv4: Active - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:07:05] (03PS1) 10Jbond: P:mail::mx: move passwords to hiera [puppet] - 10https://gerrit.wikimedia.org/r/848214 (https://phabricator.wikimedia.org/T303272) [09:08:49] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster1002.eqiad.wmnet [09:08:59] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes,service=kubemaster,name=kubemaster1002.eqiad.wmnet [09:09:51] !log Starting october reboots of lingering wikikube codfw hosts [09:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:56] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1047.eqiad.wmnet [09:11:10] (03CR) 10Jelto: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/848186 (owner: 10David Caro) [09:11:13] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=kubernetes,service=kubesvc,name=kubernetes2015.codfw.wmnet [09:12:53] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2015.codfw.wmnet [09:14:13] (03PS2) 10Jbond: P:mail::mx: move passwords to hiera [puppet] - 10https://gerrit.wikimedia.org/r/848214 (https://phabricator.wikimedia.org/T303272) [09:15:08] <_joe_> jouncebot: now and next [09:15:08] No deployments scheduled for the next 3 hour(s) and 44 minute(s) [09:15:17] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2051.codfw.wmnet [09:15:27] <_joe_> ok then I can go on with my stuff [09:16:17] (03PS3) 10Giuseppe Lavagetto: Stop assigning the PHP_ENGINE cookie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839499 (https://phabricator.wikimedia.org/T271736) [09:17:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (3) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:17:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839499 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [09:17:22] !log kubernetes2015:~$ sudo systemctl reset-failed ifup@ens13.service T273026 [09:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:29] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [09:18:02] (03Merged) 10jenkins-bot: Stop assigning the PHP_ENGINE cookie [mediawiki-config] - 10https://gerrit.wikimedia.org/r/839499 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [09:18:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321312)', diff saved to https://phabricator.wikimedia.org/P35993 and previous config saved to /var/cache/conftool/dbconfig/20221024-091801-ladsgroup.json [09:18:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [09:18:14] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:839499|Stop assigning the PHP_ENGINE cookie (T271736)]] [09:18:19] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2015.codfw.wmnet [09:18:20] T271736: Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 [09:18:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [09:18:34] !log oblivian@deploy1002 oblivian and oblivian: Backport for [[gerrit:839499|Stop assigning the PHP_ENGINE cookie (T271736)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [09:18:55] (03PS3) 10Jbond: P:mail::mx: move passwords to hiera [puppet] - 10https://gerrit.wikimedia.org/r/848214 (https://phabricator.wikimedia.org/T303272) [09:19:11] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes,service=kubesvc,name=kubernetes2015.codfw.wmnet [09:19:42] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=kubernetes,service=kubesvc,name=kubernetes2016.codfw.wmnet [09:20:06] (03PS4) 10David Caro: wmcs.create_instance_with_prefix: Add a sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 [09:20:08] (03PS1) 10David Caro: create_instance_with_prefix: fix prefix guess [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/848222 [09:20:14] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:20:25] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubernetes2016.codfw.wmnet [09:21:18] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:21:29] (03PS4) 10Jbond: P:mail::mx: move passwords to hiera [puppet] - 10https://gerrit.wikimedia.org/r/848214 (https://phabricator.wikimedia.org/T303272) [09:21:34] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:22:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:23:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:23:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T321312)', diff saved to https://phabricator.wikimedia.org/P35994 and previous config saved to /var/cache/conftool/dbconfig/20221024-092310-ladsgroup.json [09:23:13] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:839499|Stop assigning the PHP_ENGINE cookie (T271736)]] (duration: 04m 59s) [09:23:19] T271736: Migrate WMF production from PHP 7.2 to PHP 7.4 - https://phabricator.wikimedia.org/T271736 [09:23:22] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:33] (03CR) 10CI reject: [V: 04-1] create_instance_with_prefix: fix prefix guess [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/848222 (owner: 10David Caro) [09:25:51] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubernetes2016.codfw.wmnet [09:26:15] (03PS5) 10Jbond: P:mail::mx: move passwords to hiera [puppet] - 10https://gerrit.wikimedia.org/r/848214 (https://phabricator.wikimedia.org/T303272) [09:26:18] 10SRE-tools, 10Infrastructure-Foundations: Netbox accounting report: exclude removed hosts - https://phabricator.wikimedia.org/T320955 (10Volans) I agree with @ayounsi that adding a status just for this seems a bit too big of a hammer. Also there is no way to make those devices "grayed" and they will pollute a... [09:26:58] (03PS1) 10Filippo Giunchedi: dispatch: update to latest upstream [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848228 (https://phabricator.wikimedia.org/T313229) [09:27:06] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=kubernetes,service=kubemaster,name=kubemaster2001.codfw.wmnet [09:27:34] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2001.codfw.wmnet [09:29:11] (03PS6) 10Jbond: P:mail::mx: move passwords to hiera [puppet] - 10https://gerrit.wikimedia.org/r/848214 (https://phabricator.wikimedia.org/T303272) [09:29:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321312)', diff saved to https://phabricator.wikimedia.org/P35995 and previous config saved to /var/cache/conftool/dbconfig/20221024-092933-ladsgroup.json [09:30:00] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37688/console" [puppet] - 10https://gerrit.wikimedia.org/r/848214 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [09:32:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:mail::mx: move passwords to hiera [puppet] - 10https://gerrit.wikimedia.org/r/848214 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [09:34:25] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2001.codfw.wmnet [09:34:44] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes,service=kubemaster,name=kubemaster2001.codfw.wmnet [09:35:12] !log cgoubert@cumin1001 conftool action : set/pooled=no; selector: dc=codfw,cluster=kubernetes,service=kubemaster,name=kubemaster2002.codfw.wmnet [09:35:44] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host kubemaster2002.codfw.wmnet [09:36:26] (03PS1) 10Jbond: Revert "P:mail::mx: move passwords to hiera" [puppet] - 10https://gerrit.wikimedia.org/r/845760 [09:36:34] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:36:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "P:mail::mx: move passwords to hiera" [puppet] - 10https://gerrit.wikimedia.org/r/845760 (owner: 10Jbond) [09:37:13] (03PS1) 10Jbond: P:mail::mx: move passwords to hiera [puppet] - 10https://gerrit.wikimedia.org/r/845761 (https://phabricator.wikimedia.org/T303272) [09:38:41] (03PS1) 10Giuseppe Lavagetto: httpd-fcgi: further improvements for logging. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848241 [09:39:27] (03CR) 10Btullis: analytics: move kerberos::systemd_timer and deps to send_mail param (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [09:40:32] (03PS1) 10AikoChou: ml-services: update outlink Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/848245 (https://phabricator.wikimedia.org/T315994) [09:40:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T321312)', diff saved to https://phabricator.wikimedia.org/P35996 and previous config saved to /var/cache/conftool/dbconfig/20221024-094052-ladsgroup.json [09:41:00] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host kubemaster2002.codfw.wmnet [09:41:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2051.codfw.wmnet [09:41:41] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=codfw,cluster=kubernetes,service=kubemaster,name=kubemaster2002.codfw.wmnet [09:43:14] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1048.eqiad.wmnet [09:43:38] (03PS2) 10Jbond: P:mail::mx: move passwords to hiera [puppet] - 10https://gerrit.wikimedia.org/r/845761 (https://phabricator.wikimedia.org/T303272) [09:44:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37689/console" [puppet] - 10https://gerrit.wikimedia.org/r/845761 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [09:44:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P35997 and previous config saved to /var/cache/conftool/dbconfig/20221024-094440-ladsgroup.json [09:46:13] (03PS9) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) [09:46:45] (03CR) 10Filippo Giunchedi: [C: 03+2] analytics: move kerberos::systemd_timer and deps to send_mail param [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [09:46:54] (03PS5) 10Filippo Giunchedi: analytics: move kerberos::systemd_timer and deps to send_mail param [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) [09:47:20] (03PS10) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) [09:48:31] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:mail::mx: move passwords to hiera [puppet] - 10https://gerrit.wikimedia.org/r/845761 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [09:49:42] (03CR) 10Filippo Giunchedi: "Thanks Ben! Merging, let me know of any fallout!" [puppet] - 10https://gerrit.wikimedia.org/r/843885 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [09:50:13] (03CR) 10Cparle: Alerts for image suggestions pipeline (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [09:53:39] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2051.codfw.wmnet [09:53:45] (03CR) 10Klausman: [C: 03+2] ml-services: update outlink Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/848245 (https://phabricator.wikimedia.org/T315994) (owner: 10AikoChou) [09:54:38] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [09:56:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P35998 and previous config saved to /var/cache/conftool/dbconfig/20221024-095559-ladsgroup.json [09:57:38] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2052.codfw.wmnet [09:58:34] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1048.eqiad.wmnet [09:59:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P35999 and previous config saved to /var/cache/conftool/dbconfig/20221024-095946-ladsgroup.json [09:59:58] (03Merged) 10jenkins-bot: ml-services: update outlink Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/848245 (https://phabricator.wikimedia.org/T315994) (owner: 10AikoChou) [10:00:28] (03CR) 10Filippo Giunchedi: "LGTM, see inline for a comment re: alert expression and a bunch of nits" [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [10:01:59] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1049.eqiad.wmnet [10:02:25] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848241 (owner: 10Giuseppe Lavagetto) [10:03:54] 10SRE, 10Znuny, 10serviceops-collab, 10Patch-For-Review: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10jbond) 05Open→03Resolved a:03jbond >>! In T303272#8336257, @Dzahn wrote: > The password at `modules/passwords/manifests/init.pp: $vrts_mysql_pas... [10:04:32] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,cluster=kubernetes-staging,service=kubemaster [10:06:02] (03PS3) 10Jbond: P:puppetdb: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/842854 [10:06:56] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:07:43] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [10:07:56] !log upload wmf-beamer-style 0.3 to apt [10:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P36000 and previous config saved to /var/cache/conftool/dbconfig/20221024-101105-ladsgroup.json [10:14:17] (03PS10) 10Jbond: P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461 [10:14:22] (03CR) 10Jbond: [C: 03+2] P:puppetdb: add documentation and fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/842854 (owner: 10Jbond) [10:14:38] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [10:14:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321312)', diff saved to https://phabricator.wikimedia.org/P36001 and previous config saved to /var/cache/conftool/dbconfig/20221024-101453-ladsgroup.json [10:14:59] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST metrics) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:14:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [10:15:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [10:15:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T321312)', diff saved to https://phabricator.wikimedia.org/P36002 and previous config saved to /var/cache/conftool/dbconfig/20221024-101518-ladsgroup.json [10:15:26] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2052.codfw.wmnet [10:15:59] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2053.codfw.wmnet [10:16:49] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1049.eqiad.wmnet [10:17:26] (03CR) 10Jbond: [C: 03+2] P:cumin::master: Add aliases for lvs traffic classes [puppet] - 10https://gerrit.wikimedia.org/r/844461 (owner: 10Jbond) [10:18:01] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1050.eqiad.wmnet [10:22:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321312)', diff saved to https://phabricator.wikimedia.org/P36003 and previous config saved to /var/cache/conftool/dbconfig/20221024-102237-ladsgroup.json [10:23:00] (03PS11) 10Cparle: Alerts for image suggestions pipeline [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) [10:23:17] (03CR) 10Cparle: Alerts for image suggestions pipeline (034 comments) [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [10:25:02] (03PS7) 10Hnowlan: helmfile.d: add thumbor configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) [10:25:31] (03PS1) 10Jbond: P:puppetdb: fix ip address [puppet] - 10https://gerrit.wikimedia.org/r/848256 [10:25:45] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:puppetdb: fix ip address [puppet] - 10https://gerrit.wikimedia.org/r/848256 (owner: 10Jbond) [10:26:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T321312)', diff saved to https://phabricator.wikimedia.org/P36004 and previous config saved to /var/cache/conftool/dbconfig/20221024-102612-ladsgroup.json [10:26:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1107.eqiad.wmnet with reason: Maintenance [10:26:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1107.eqiad.wmnet with reason: Maintenance [10:26:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1107 (T321312)', diff saved to https://phabricator.wikimedia.org/P36005 and previous config saved to /var/cache/conftool/dbconfig/20221024-102636-ladsgroup.json [10:26:57] (03CR) 10Btullis: [C: 03+2] Update the email address for data-engineering alerts [puppet] - 10https://gerrit.wikimedia.org/r/845030 (https://phabricator.wikimedia.org/T315486) (owner: 10Btullis) [10:28:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Nicely done! (you can self-merge at will, alerts will be deployed at the next puppet run on prometheus hosts)" [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [10:29:34] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1050.eqiad.wmnet [10:30:13] (03CR) 10Cparle: "Thanks Filippo! Just need to get the airflow job that pushes to prometheus merged first ..." [alerts] - 10https://gerrit.wikimedia.org/r/843996 (https://phabricator.wikimedia.org/T312235) (owner: 10Cparle) [10:30:41] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2053.codfw.wmnet [10:32:31] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1051.eqiad.wmnet [10:32:57] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2054.codfw.wmnet [10:33:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T321312)', diff saved to https://phabricator.wikimedia.org/P36006 and previous config saved to /var/cache/conftool/dbconfig/20221024-103305-ladsgroup.json [10:35:15] (03CR) 10Nikerabbit: [C: 03+1] Enable Section Translation in Hawaiian, Pashto and Xhosa WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845573 (https://phabricator.wikimedia.org/T317289) (owner: 10KartikMistry) [10:37:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P36007 and previous config saved to /var/cache/conftool/dbconfig/20221024-103743-ladsgroup.json [10:39:09] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1051.eqiad.wmnet [10:42:02] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2054.codfw.wmnet [10:42:08] (03CR) 10Hnowlan: New organization of templates (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/837495 (owner: 10Giuseppe Lavagetto) [10:42:31] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1052.eqiad.wmnet [10:43:04] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2055.codfw.wmnet [10:44:43] (03PS1) 10Btullis: Update remaining references to analytics-alerts email [puppet] - 10https://gerrit.wikimedia.org/r/848262 (https://phabricator.wikimedia.org/T315486) [10:46:21] (03CR) 10Btullis: [C: 03+2] Update remaining references to analytics-alerts email [puppet] - 10https://gerrit.wikimedia.org/r/848262 (https://phabricator.wikimedia.org/T315486) (owner: 10Btullis) [10:46:23] (03PS2) 10Giuseppe Lavagetto: httpd-fcgi: further improvements for logging. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848241 (https://phabricator.wikimedia.org/T301757) [10:46:33] (03PS1) 10Giuseppe Lavagetto: shellbox: add new env variables from the httpd-fcgi image [deployment-charts] - 10https://gerrit.wikimedia.org/r/848263 (https://phabricator.wikimedia.org/T301757) [10:46:35] (03PS1) 10Giuseppe Lavagetto: shellbox: switch to ecs logging, skip system logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/848264 (https://phabricator.wikimedia.org/T301757) [10:47:02] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1052.eqiad.wmnet [10:47:58] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1053.eqiad.wmnet [10:48:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P36008 and previous config saved to /var/cache/conftool/dbconfig/20221024-104812-ladsgroup.json [10:49:06] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P36009 and previous config saved to /var/cache/conftool/dbconfig/20221024-105250-ladsgroup.json [10:54:18] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1053.eqiad.wmnet [10:54:31] (03CR) 10Clément Goubert: [C: 03+1] httpd-fcgi: further improvements for logging. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848241 (https://phabricator.wikimedia.org/T301757) (owner: 10Giuseppe Lavagetto) [11:01:04] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10Volans) One way to achieve the above could be to store the disks in Netbox. They could be stored as... [11:03:16] PROBLEM - Host ms-be2055 is DOWN: PING CRITICAL - Packet loss = 100% [11:03:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P36010 and previous config saved to /var/cache/conftool/dbconfig/20221024-110318-ladsgroup.json [11:03:38] RECOVERY - Host ms-be2055 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [11:05:22] PROBLEM - SSH on ms-be2055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:07:10] jouncebot: nowandnext [11:07:10] No deployments scheduled for the next 1 hour(s) and 52 minute(s) [11:07:10] In 1 hour(s) and 52 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221024T1300) [11:07:15] awesome [11:07:27] (03CR) 10Ladsgroup: [C: 03+2] Enable LBFactory config callback in CLI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848201 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [11:07:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321312)', diff saved to https://phabricator.wikimedia.org/P36011 and previous config saved to /var/cache/conftool/dbconfig/20221024-110756-ladsgroup.json [11:08:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [11:08:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848201 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [11:08:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2153.codfw.wmnet with reason: Maintenance [11:08:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T321312)', diff saved to https://phabricator.wikimedia.org/P36012 and previous config saved to /var/cache/conftool/dbconfig/20221024-110822-ladsgroup.json [11:08:55] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (3) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:09:01] (03Merged) 10jenkins-bot: Enable LBFactory config callback in CLI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848201 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [11:09:13] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:848201|Enable LBFactory config callback in CLI (T298485)]] [11:09:18] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [11:09:22] RECOVERY - SSH on ms-be2055 is OK: SSH OK - OpenSSH_8.4p1 Debian-5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [11:09:33] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:848201|Enable LBFactory config callback in CLI (T298485)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [11:11:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1189', diff saved to https://phabricator.wikimedia.org/P36013 and previous config saved to /var/cache/conftool/dbconfig/20221024-111121-ladsgroup.json [11:13:18] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:51] (03PS4) 10Stang: logos: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) [11:14:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321312)', diff saved to https://phabricator.wikimedia.org/P36014 and previous config saved to /var/cache/conftool/dbconfig/20221024-111439-ladsgroup.json [11:15:00] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:16:10] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:35] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2055.codfw.wmnet [11:17:06] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 9.159 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:17:49] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:848201|Enable LBFactory config callback in CLI (T298485)]] (duration: 08m 35s) [11:17:54] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [11:18:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool db1189', diff saved to https://phabricator.wikimedia.org/P36015 and previous config saved to /var/cache/conftool/dbconfig/20221024-111813-ladsgroup.json [11:18:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T321312)', diff saved to https://phabricator.wikimedia.org/P36016 and previous config saved to /var/cache/conftool/dbconfig/20221024-111825-ladsgroup.json [11:18:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [11:18:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1118.eqiad.wmnet with reason: Maintenance [11:18:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T321312)', diff saved to https://phabricator.wikimedia.org/P36017 and previous config saved to /var/cache/conftool/dbconfig/20221024-111849-ladsgroup.json [11:25:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T321312)', diff saved to https://phabricator.wikimedia.org/P36018 and previous config saved to /var/cache/conftool/dbconfig/20221024-112515-ladsgroup.json [11:28:50] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (3) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:29:12] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) > 'Identifiers': [{'DurableName': '500056b3d120b9c5', 'DurableNameFormat': 'NAA'}], Linux mo... [11:29:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P36019 and previous config saved to /var/cache/conftool/dbconfig/20221024-112946-ladsgroup.json [11:30:16] (03CR) 10JMeybohm: coredns: upgrade to 1.8.7 (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [11:31:14] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:33:49] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (3) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:38:49] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (4) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:39:55] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (4) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:40:22] 10SRE-swift-storage, 10Infrastructure-Foundations: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) if we can get the pci address for the raid card from redfish we may also be able top use `/d... [11:40:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P36020 and previous config saved to /var/cache/conftool/dbconfig/20221024-114022-ladsgroup.json [11:43:49] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (4) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:44:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P36021 and previous config saved to /var/cache/conftool/dbconfig/20221024-114452-ladsgroup.json [11:51:46] (03PS5) 10Jbond: sre.dns.netbox: add call to sre.puppet.sync-netbox-hiera [cookbooks] - 10https://gerrit.wikimedia.org/r/804575 [11:53:49] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (4) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:54:07] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1054.eqiad.wmnet [11:54:13] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2056.codfw.wmnet [11:55:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P36022 and previous config saved to /var/cache/conftool/dbconfig/20221024-115528-ladsgroup.json [11:59:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321312)', diff saved to https://phabricator.wikimedia.org/P36023 and previous config saved to /var/cache/conftool/dbconfig/20221024-115959-ladsgroup.json [12:00:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [12:00:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [12:00:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P36024 and previous config saved to /var/cache/conftool/dbconfig/20221024-120026-ladsgroup.json [12:01:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P36025 and previous config saved to /var/cache/conftool/dbconfig/20221024-120153-ladsgroup.json [12:08:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P36027 and previous config saved to /var/cache/conftool/dbconfig/20221024-120900-ladsgroup.json [12:09:29] (03PS1) 10AikoChou: ml-services: add EventGate settings for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/848291 (https://phabricator.wikimedia.org/T315994) [12:10:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T321312)', diff saved to https://phabricator.wikimedia.org/P36028 and previous config saved to /var/cache/conftool/dbconfig/20221024-121034-ladsgroup.json [12:10:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [12:10:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1119.eqiad.wmnet with reason: Maintenance [12:10:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T321312)', diff saved to https://phabricator.wikimedia.org/P36029 and previous config saved to /var/cache/conftool/dbconfig/20221024-121058-ladsgroup.json [12:13:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:13:34] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2056.codfw.wmnet [12:13:47] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be1054.eqiad.wmnet [12:13:57] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1054.eqiad.wmnet [12:14:31] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2057.codfw.wmnet [12:15:23] !log restarting blazegraph on wdqs1005, wdqs1006, wdqs1012 and wdqs1016 (BlazegraphFreeAllocatorsDecreasingRapidly) [12:15:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T321312)', diff saved to https://phabricator.wikimedia.org/P36030 and previous config saved to /var/cache/conftool/dbconfig/20221024-121829-ladsgroup.json [12:19:02] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:19:55] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (4) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:21:26] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:22:56] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1054.eqiad.wmnet [12:23:49] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (4) Blazegraph instance wdqs1005:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:23:59] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1055.eqiad.wmnet [12:24:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P36031 and previous config saved to /var/cache/conftool/dbconfig/20221024-122407-ladsgroup.json [12:24:08] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2057.codfw.wmnet [12:25:13] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2058.codfw.wmnet [12:27:22] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.543 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:27:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 1.317 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:28:00] (03CR) 10Klausman: [C: 03+2] ml-services: add EventGate settings for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/848291 (https://phabricator.wikimedia.org/T315994) (owner: 10AikoChou) [12:31:15] (03Merged) 10jenkins-bot: ml-services: add EventGate settings for outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/848291 (https://phabricator.wikimedia.org/T315994) (owner: 10AikoChou) [12:33:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P36032 and previous config saved to /var/cache/conftool/dbconfig/20221024-123336-ladsgroup.json [12:34:04] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2058.codfw.wmnet [12:34:42] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2059.codfw.wmnet [12:34:55] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1055.eqiad.wmnet [12:35:21] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1056.eqiad.wmnet [12:35:41] (03PS1) 10Jbond: C:swift::storage: drop unused udev rule [puppet] - 10https://gerrit.wikimedia.org/r/848302 (https://phabricator.wikimedia.org/T163673) [12:36:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37690/console" [puppet] - 10https://gerrit.wikimedia.org/r/848302 (https://phabricator.wikimedia.org/T163673) (owner: 10Jbond) [12:39:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P36033 and previous config saved to /var/cache/conftool/dbconfig/20221024-123913-ladsgroup.json [12:45:55] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37691/console" [puppet] - 10https://gerrit.wikimedia.org/r/848302 (https://phabricator.wikimedia.org/T163673) (owner: 10Jbond) [12:47:48] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1056.eqiad.wmnet [12:48:23] !log jbond@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync data - jbond@cumin1001" [12:48:29] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2059.codfw.wmnet [12:48:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P36034 and previous config saved to /var/cache/conftool/dbconfig/20221024-124842-ladsgroup.json [12:49:40] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync data - jbond@cumin1001" [12:50:26] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:50:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:50:48] (03CR) 10Herron: [C: 03+1] dispatch: assign backend role [puppet] - 10https://gerrit.wikimedia.org/r/848191 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [12:52:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.148 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:52:30] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1057.eqiad.wmnet [12:52:46] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 4.201 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:52:59] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2060.codfw.wmnet [12:54:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321312)', diff saved to https://phabricator.wikimedia.org/P36035 and previous config saved to /var/cache/conftool/dbconfig/20221024-125420-ladsgroup.json [12:58:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P36036 and previous config saved to /var/cache/conftool/dbconfig/20221024-125836-ladsgroup.json [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221024T1300). [13:00:04] koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:42] o/ [13:01:49] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2060.codfw.wmnet [13:03:40] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2061.codfw.wmnet [13:03:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T321312)', diff saved to https://phabricator.wikimedia.org/P36037 and previous config saved to /var/cache/conftool/dbconfig/20221024-130349-ladsgroup.json [13:03:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [13:04:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [13:04:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T321312)', diff saved to https://phabricator.wikimedia.org/P36038 and previous config saved to /var/cache/conftool/dbconfig/20221024-130413-ladsgroup.json [13:04:38] o/ [13:06:12] I don’t think I can deploy today, sorry [13:06:32] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:06:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:07:09] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:07:15] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:08:26] RECOVERY - Check systemd state on dbstore1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T321312)', diff saved to https://phabricator.wikimedia.org/P36039 and previous config saved to /var/cache/conftool/dbconfig/20221024-131037-ladsgroup.json [13:11:25] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1057.eqiad.wmnet [13:12:19] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:13:07] (03CR) 10FNegri: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/831036 (owner: 10David Caro) [13:13:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P36040 and previous config saved to /var/cache/conftool/dbconfig/20221024-131343-ladsgroup.json [13:16:49] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:16:51] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:18:39] 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 5 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10ItamarWMDE) [13:20:31] 10SRE-OnFire, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, and 5 others: Only generate maxlag from pooled query service servers. - https://phabricator.wikimedia.org/T238751 (10Addshore) {meme, src=itshappening} [13:20:57] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:25:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P36041 and previous config saved to /var/cache/conftool/dbconfig/20221024-132544-ladsgroup.json [13:28:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P36042 and previous config saved to /var/cache/conftool/dbconfig/20221024-132849-ladsgroup.json [13:29:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/848302 (https://phabricator.wikimedia.org/T163673) (owner: 10Jbond) [13:29:24] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] dispatch: assign backend role [puppet] - 10https://gerrit.wikimedia.org/r/848191 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [13:29:26] (03PS10) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) [13:31:18] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2061.codfw.wmnet [13:34:24] Hi, is there anyone could deploy today? [13:34:59] !log delete PNI to cloudflare - T259036 [13:35:03] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1058.eqiad.wmnet [13:35:33] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2062.codfw.wmnet [13:40:16] (03PS11) 10Hashar: scap: automatize plugins handling [software/gerrit] (deploy/wmf/stable-3.4) - 10https://gerrit.wikimedia.org/r/831093 (https://phabricator.wikimedia.org/T317412) [13:40:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P36043 and previous config saved to /var/cache/conftool/dbconfig/20221024-134050-ladsgroup.json [13:41:16] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:42:01] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [13:42:03] koi: I can try to look at it now [13:42:13] but I don’t think I’ve deployed a dblist change before [13:42:20] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:42:22] so I’ll try to find out if there’s anything I need to watch out for [13:42:39] (the diffConfig looks good, at least) [13:42:49] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2062.codfw.wmnet [13:42:54] ok, thanks! [13:43:41] looks like https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/836878 didn’t require anything special [13:43:45] so I guess I’ll try my luck [13:43:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P36044 and previous config saved to /var/cache/conftool/dbconfig/20221024-134356-ladsgroup.json [13:44:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [13:44:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2173.codfw.wmnet with reason: Maintenance [13:44:16] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.820 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:44:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [13:44:25] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2063.codfw.wmnet [13:44:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [13:44:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2173 (T321312)', diff saved to https://phabricator.wikimedia.org/P36045 and previous config saved to /var/cache/conftool/dbconfig/20221024-134437-ladsgroup.json [13:44:39] (03PS2) 10Lucas Werkmeister (WMDE): plwikimedia: Enable VisualEditor by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845860 (https://phabricator.wikimedia.org/T321308) (owner: 10Stang) [13:45:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845860 (https://phabricator.wikimedia.org/T321308) (owner: 10Stang) [13:45:39] (03PS1) 10Ayounsi: Decom CF PNI [homer/public] - 10https://gerrit.wikimedia.org/r/848343 (https://phabricator.wikimedia.org/T259036) [13:45:58] (03Merged) 10jenkins-bot: plwikimedia: Enable VisualEditor by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845860 (https://phabricator.wikimedia.org/T321308) (owner: 10Stang) [13:46:14] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:845860|plwikimedia: Enable VisualEditor by default (T321308)]] [13:46:19] T321308: Enable VisualEditor by default on pl.wikimedia.org - https://phabricator.wikimedia.org/T321308 [13:46:33] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and stang: Backport for [[gerrit:845860|plwikimedia: Enable VisualEditor by default (T321308)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [13:46:47] (03PS1) 10Elukey: admin_ng: update Istio settings for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/848344 (https://phabricator.wikimedia.org/T320374) [13:47:01] koi: can you test it? [13:47:10] looking [13:47:25] (hm, `scap backport` has to assume IRC names match gerrit names, I didn’t realize that before) [13:47:35] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1058.eqiad.wmnet [13:47:57] (03CR) 10Ayounsi: [C: 03+2] Decom CF PNI [homer/public] - 10https://gerrit.wikimedia.org/r/848343 (https://phabricator.wikimedia.org/T259036) (owner: 10Ayounsi) [13:48:22] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1059.eqiad.wmnet [13:48:36] (03Merged) 10jenkins-bot: Decom CF PNI [homer/public] - 10https://gerrit.wikimedia.org/r/848343 (https://phabricator.wikimedia.org/T259036) (owner: 10Ayounsi) [13:48:57] (03PS2) 10Elukey: admin_ng: update Istio settings for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/848344 (https://phabricator.wikimedia.org/T320374) [13:49:00] Lucas_WMDE: I see an "Edytuj" tab in incognito mode, and the visual editor disappeared in beta feature, so I think LGTM [13:49:23] \o/ [13:49:28] thanks, syncing [13:49:39] (03PS3) 10Elukey: admin_ng: update Istio settings for ml-serve clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/848344 (https://phabricator.wikimedia.org/T320374) [13:50:08] I don’t feel confident to review that logos change, sorry [13:50:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:50:48] that's ok, I'll find someone else [13:50:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Jclark-ctr) Verified mgmt cables they are connected and have link [13:51:00] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10MatthewVernon) I find myself naively wondering if a script that ran on the ho... [13:51:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T321312)', diff saved to https://phabricator.wikimedia.org/P36046 and previous config saved to /var/cache/conftool/dbconfig/20221024-135105-ladsgroup.json [13:53:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:53:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:53:35] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2063.codfw.wmnet [13:53:39] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:845860|plwikimedia: Enable VisualEditor by default (T321308)]] (duration: 07m 25s) [13:53:49] T321308: Enable VisualEditor by default on pl.wikimedia.org - https://phabricator.wikimedia.org/T321308 [13:54:02] !log UTC afternoon backport+config window done [13:54:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:55:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T321312)', diff saved to https://phabricator.wikimedia.org/P36047 and previous config saved to /var/cache/conftool/dbconfig/20221024-135557-ladsgroup.json [13:58:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1133.eqiad.wmnet with reason: Maintenance [13:58:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1133.eqiad.wmnet with reason: Maintenance [13:58:36] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1059.eqiad.wmnet [14:00:38] (03PS1) 10AikoChou: ml-services: move eventgate env variables to outlink predictor [deployment-charts] - 10https://gerrit.wikimedia.org/r/848348 (https://phabricator.wikimedia.org/T315994) [14:01:09] (03CR) 10Filippo Giunchedi: "I have run another audit from John's puppetdb resources dump and this is what's left:" [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [14:02:20] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:02:23] (03CR) 10AikoChou: "/me hides in shame" [deployment-charts] - 10https://gerrit.wikimedia.org/r/848348 (https://phabricator.wikimedia.org/T315994) (owner: 10AikoChou) [14:03:40] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:03:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [14:03:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [14:04:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T321312)', diff saved to https://phabricator.wikimedia.org/P36048 and previous config saved to /var/cache/conftool/dbconfig/20221024-140404-ladsgroup.json [14:04:55] (03PS2) 10Filippo Giunchedi: systemd: drop timer-specific alert in favor of generic alert [puppet] - 10https://gerrit.wikimedia.org/r/841924 (https://phabricator.wikimedia.org/T303253) [14:04:57] (03PS1) 10Filippo Giunchedi: Use generic 'Check systemd state' alert to catch timer failures [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) [14:06:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P36049 and previous config saved to /var/cache/conftool/dbconfig/20221024-140612-ladsgroup.json [14:09:08] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:09:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T321312)', diff saved to https://phabricator.wikimedia.org/P36050 and previous config saved to /var/cache/conftool/dbconfig/20221024-140917-ladsgroup.json [14:09:20] (03PS1) 10Jbond: admin: drop bscarone [puppet] - 10https://gerrit.wikimedia.org/r/848350 [14:09:35] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:10:10] (03CR) 10Filippo Giunchedi: "Let me know what you think, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/848349 (https://phabricator.wikimedia.org/T303253) (owner: 10Filippo Giunchedi) [14:10:12] (03CR) 10Elukey: [C: 03+2] "It happens to the best, no need to hide in shame :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/848348 (https://phabricator.wikimedia.org/T315994) (owner: 10AikoChou) [14:10:37] (03CR) 10Jbond: [C: 03+2] admin: drop bscarone [puppet] - 10https://gerrit.wikimedia.org/r/848350 (owner: 10Jbond) [14:11:59] (03CR) 10Hnowlan: helmfile.d: add thumbor configuration (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/824519 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [14:12:40] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 7.832 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:13:26] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.050 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:15:26] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:16:05] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [14:18:48] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:21:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P36051 and previous config saved to /var/cache/conftool/dbconfig/20221024-142118-ladsgroup.json [14:24:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P36052 and previous config saved to /var/cache/conftool/dbconfig/20221024-142423-ladsgroup.json [14:26:15] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-52] - https://phabricator.wikimedia.org/T317244 (10ssingh) [14:26:44] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:27:00] (03PS1) 10David Caro: reprepro: add kubeadm-k8s-1-21/22 bullseye suite [puppet] - 10https://gerrit.wikimedia.org/r/848354 (https://phabricator.wikimedia.org/T316541) [14:27:22] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:27:35] (03CR) 10CI reject: [V: 04-1] reprepro: add kubeadm-k8s-1-21/22 bullseye suite [puppet] - 10https://gerrit.wikimedia.org/r/848354 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro) [14:28:08] (03PS2) 10David Caro: reprepro: add kubeadm-k8s-1-21/22 bullseye suite [puppet] - 10https://gerrit.wikimedia.org/r/848354 (https://phabricator.wikimedia.org/T316541) [14:28:43] (03PS4) 10Elukey: coredns: upgrade to 1.8.7 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) [14:29:09] (03CR) 10Elukey: coredns: upgrade to 1.8.7 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/844499 (https://phabricator.wikimedia.org/T321159) (owner: 10Elukey) [14:29:38] (03CR) 10Majavah: [C: 04-1] "If this is for Docker for Harbor, please check first if the Docker versions packaged by Debian are suitable. K8s will be moving from Docke" [puppet] - 10https://gerrit.wikimedia.org/r/848354 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro) [14:29:54] (03PS3) 10David Caro: reprepro: add kubeadm-k8s-1-21/22 bullseye suite [puppet] - 10https://gerrit.wikimedia.org/r/848354 (https://phabricator.wikimedia.org/T316541) [14:29:56] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:30:21] (03CR) 10Andrew Bogott: [C: 03+2] alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 (owner: 10Andrew Bogott) [14:30:30] (03PS15) 10Andrew Bogott: alerts.downtime_host: attempt to match alert hostnames with : [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837132 [14:30:41] (03CR) 10David Caro: reprepro: add kubeadm-k8s-1-21/22 bullseye suite (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848354 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro) [14:35:21] (03PS1) 10AikoChou: ml-services: add MODEL_VERSION env to outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/848355 (https://phabricator.wikimedia.org/T315994) [14:36:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T321312)', diff saved to https://phabricator.wikimedia.org/P36053 and previous config saved to /var/cache/conftool/dbconfig/20221024-143625-ladsgroup.json [14:36:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [14:36:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2174.codfw.wmnet with reason: Maintenance [14:36:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T321312)', diff saved to https://phabricator.wikimedia.org/P36054 and previous config saved to /var/cache/conftool/dbconfig/20221024-143650-ladsgroup.json [14:39:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P36055 and previous config saved to /var/cache/conftool/dbconfig/20221024-143930-ladsgroup.json [14:42:12] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:42:30] !log drain NTT on cr1-eqiad [14:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:56] (03PS1) 10David Caro: p::toolforge:harbor: use distro docker for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) [14:43:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T321312)', diff saved to https://phabricator.wikimedia.org/P36056 and previous config saved to /var/cache/conftool/dbconfig/20221024-144311-ladsgroup.json [14:44:39] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [14:45:08] (03CR) 10CI reject: [V: 04-1] p::toolforge:harbor: use distro docker for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro) [14:46:12] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS2914/IPv6: Idle - NTT, AS2914/IPv4: Idle - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:46:26] (03PS2) 10David Caro: p::toolforge:harbor: use distro docker for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) [14:47:04] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.467 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:52] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.503 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:48:43] (03CR) 10Elukey: [C: 03+2] ml-services: add MODEL_VERSION env to outlink [deployment-charts] - 10https://gerrit.wikimedia.org/r/848355 (https://phabricator.wikimedia.org/T315994) (owner: 10AikoChou) [14:54:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T321312)', diff saved to https://phabricator.wikimedia.org/P36057 and previous config saved to /var/cache/conftool/dbconfig/20221024-145436-ladsgroup.json [14:54:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:55:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:55:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T321312)', diff saved to https://phabricator.wikimedia.org/P36058 and previous config saved to /var/cache/conftool/dbconfig/20221024-145511-ladsgroup.json [14:58:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P36059 and previous config saved to /var/cache/conftool/dbconfig/20221024-145817-ladsgroup.json [15:00:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T321312)', diff saved to https://phabricator.wikimedia.org/P36060 and previous config saved to /var/cache/conftool/dbconfig/20221024-150024-ladsgroup.json [15:00:28] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:01:24] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Create dashboard showing aggregate data transfer rates per DC/cluster - https://phabricator.wikimedia.org/T284304 (10BCornwall) a:03BCornwall [15:01:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:04:28] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:05:44] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.271 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:09:24] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1060.eqiad.wmnet [15:10:02] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2064.codfw.wmnet [15:12:49] !log mforns@deploy1002 Started deploy [analytics/refinery@d3b7785]: Regular analytics weekly train [analytics/refinery@d3b7785] [15:12:56] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:13:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P36061 and previous config saved to /var/cache/conftool/dbconfig/20221024-151324-ladsgroup.json [15:14:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:15:28] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:15:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P36062 and previous config saved to /var/cache/conftool/dbconfig/20221024-151530-ladsgroup.json [15:16:08] (03CR) 10Herron: [C: 03+2] prometheus: enable prometheus web access via proxy with IDP [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [15:18:23] !log mforns@deploy1002 Finished deploy [analytics/refinery@d3b7785]: Regular analytics weekly train [analytics/refinery@d3b7785] (duration: 05m 34s) [15:18:47] !log mforns@deploy1002 Started deploy [analytics/refinery@d3b7785] (thin): Regular analytics weekly train THIN [analytics/refinery@d3b7785] [15:18:52] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2064.codfw.wmnet [15:18:56] !log mforns@deploy1002 Finished deploy [analytics/refinery@d3b7785] (thin): Regular analytics weekly train THIN [analytics/refinery@d3b7785] (duration: 00m 09s) [15:19:54] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2065.codfw.wmnet [15:20:32] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 221, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:22:41] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:23:15] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [15:23:42] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1060.eqiad.wmnet [15:24:40] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:25:04] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [15:26:09] !log drain eqiad-esams transport [15:26:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:26] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1061.eqiad.wmnet [15:28:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T321312)', diff saved to https://phabricator.wikimedia.org/P36063 and previous config saved to /var/cache/conftool/dbconfig/20221024-152830-ladsgroup.json [15:28:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [15:28:47] 10SRE, 10SRE-swift-storage, 10Data-Engineering-Planning, 10Wikidata, and 3 others: Clean up the rdf-streaming-updater-codfw container from thanos-swift. - https://phabricator.wikimedia.org/T316031 (10dcausse) @bking I see that the doc has been updated, can we move this ticket to the Needs reporting column? [15:28:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2176.codfw.wmnet with reason: Maintenance [15:28:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T321312)', diff saved to https://phabricator.wikimedia.org/P36064 and previous config saved to /var/cache/conftool/dbconfig/20221024-152856-ladsgroup.json [15:29:35] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@62b4181]: (no justification provided) [15:29:42] RECOVERY - Check systemd state on elastic1072 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:43] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10RKemper) [15:29:46] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@62b4181]: (no justification provided) (duration: 00m 11s) [15:30:09] jan_drewniak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221024T1530). [15:30:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P36065 and previous config saved to /var/cache/conftool/dbconfig/20221024-153037-ladsgroup.json [15:32:48] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2065.codfw.wmnet [15:34:07] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2066.codfw.wmnet [15:35:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T321312)', diff saved to https://phabricator.wikimedia.org/P36066 and previous config saved to /var/cache/conftool/dbconfig/20221024-153515-ladsgroup.json [15:36:17] (03CR) 10Herron: [C: 03+1] dispatch: update to latest upstream [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/848228 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [15:38:50] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [15:41:59] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2066.codfw.wmnet [15:42:00] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1061.eqiad.wmnet [15:42:39] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1062.eqiad.wmnet [15:43:20] PROBLEM - SSH on mw1328.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:43:23] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2067.codfw.wmnet [15:44:02] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 22 Dec 2022 06:15:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:44:38] PROBLEM - SSH on db1119.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:44:46] (03PS1) 10Jbond: C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) [15:44:48] (03PS1) 10Jbond: P:swift::storage: add new resource to format via pci path [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677) [15:44:50] (03PS1) 10Jbond: ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) [15:45:43] (03CR) 10CI reject: [V: 04-1] C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [15:45:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T321312)', diff saved to https://phabricator.wikimedia.org/P36067 and previous config saved to /var/cache/conftool/dbconfig/20221024-154543-ladsgroup.json [15:45:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:46:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [15:46:47] (03CR) 10MVernon: [C: 03+1] "Seems reasonable to remove to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/848302 (https://phabricator.wikimedia.org/T163673) (owner: 10Jbond) [15:49:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [15:49:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [15:49:46] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) [15:49:53] (03CR) 10Brennen Bearnes: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/848186 (owner: 10David Caro) [15:50:06] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1062.eqiad.wmnet [15:50:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P36068 and previous config saved to /var/cache/conftool/dbconfig/20221024-155022-ladsgroup.json [15:50:33] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) a:05Papaul→03ayounsi [15:51:02] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be1063.eqiad.wmnet [15:51:19] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be2067.codfw.wmnet [15:52:07] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host ms-be2068.codfw.wmnet [15:52:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:53:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [15:53:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T321312)', diff saved to https://phabricator.wikimedia.org/P36069 and previous config saved to /var/cache/conftool/dbconfig/20221024-155313-ladsgroup.json [15:53:40] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.373 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:53:42] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 1.165 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:56:19] (03PS2) 10Jbond: C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) [15:56:21] (03PS2) 10Jbond: P:swift::storage: add new resource to format via pci path [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677) [15:56:25] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) a:05ayounsi→03Papaul [15:56:29] (03PS2) 10Jbond: ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) [15:57:15] (03CR) 10CI reject: [V: 04-1] C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [15:57:24] (03CR) 10Michael Große: [C: 04-1] "While this should be already mergeable in principle, it probably makes sense to wait till at least Ic740c43f345c41cad0d28a68fdbd75f0acea2d" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [15:59:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T321312)', diff saved to https://phabricator.wikimedia.org/P36070 and previous config saved to /var/cache/conftool/dbconfig/20221024-155926-ladsgroup.json [16:00:17] (03CR) 10Jdlrobson: "Thanks a bunch for working on this!!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) (owner: 10Stang) [16:00:47] (03CR) 10Jdlrobson: [C: 04-1] logos: Automate icon generation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) (owner: 10Stang) [16:02:59] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:03:43] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be2068.codfw.wmnet [16:04:40] (03CR) 10Stang: logos: Automate icon generation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) (owner: 10Stang) [16:05:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:05:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P36071 and previous config saved to /var/cache/conftool/dbconfig/20221024-160528-ladsgroup.json [16:06:29] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.194 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:06:58] (03CR) 10David Caro: [C: 03+2] gitlab_runner: add toolforge ci images to allowed list [puppet] - 10https://gerrit.wikimedia.org/r/848186 (owner: 10David Caro) [16:07:17] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.499 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:09:35] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:11:33] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-be1063.eqiad.wmnet [16:14:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P36072 and previous config saved to /var/cache/conftool/dbconfig/20221024-161432-ladsgroup.json [16:15:49] (03PS1) 10Ssingh: cp4023: decommission host as part of the ulsfo hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/848423 (https://phabricator.wikimedia.org/T317244) [16:16:59] (03Abandoned) 10Ssingh: cp4049: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/845075 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [16:17:07] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:17:14] (03CR) 10Ssingh: "do not merge before Tuesday Oct 25" [puppet] - 10https://gerrit.wikimedia.org/r/848423 (https://phabricator.wikimedia.org/T317244) (owner: 10Ssingh) [16:20:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T321312)', diff saved to https://phabricator.wikimedia.org/P36073 and previous config saved to /var/cache/conftool/dbconfig/20221024-162035-ladsgroup.json [16:23:24] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): Replace labstore100[67] with clouddumps100[12] - https://phabricator.wikimedia.org/T309346 (10BTullis) The directory `/srv/deployment/analytics` had incorrect ownership on the new hosts, so our deployment failed. https... [16:26:55] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [16:29:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P36074 and previous config saved to /var/cache/conftool/dbconfig/20221024-162939-ladsgroup.json [16:30:16] (03PS2) 10Cwhite: logstash: add sanitize filter [puppet] - 10https://gerrit.wikimedia.org/r/844556 (https://phabricator.wikimedia.org/T321241) [16:33:39] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:34:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:35:44] (03PS1) 10Jbond: sre.swift.audit-labels: Audit the disk labels of swift backend hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 [16:37:49] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 9.291 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:38:08] (03PS2) 10Jbond: sre.swift.audit-labels: Audit the disk labels of swift backend hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 [16:38:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 4.684 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:41:27] (03CR) 10CI reject: [V: 04-1] sre.swift.audit-labels: Audit the disk labels of swift backend hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 (owner: 10Jbond) [16:44:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T321312)', diff saved to https://phabricator.wikimedia.org/P36075 and previous config saved to /var/cache/conftool/dbconfig/20221024-164446-ladsgroup.json [16:44:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:45:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1184.eqiad.wmnet with reason: Maintenance [16:45:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T321312)', diff saved to https://phabricator.wikimedia.org/P36076 and previous config saved to /var/cache/conftool/dbconfig/20221024-164510-ladsgroup.json [16:45:33] RECOVERY - SSH on db1119.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:47:31] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:52:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T321312)', diff saved to https://phabricator.wikimedia.org/P36077 and previous config saved to /var/cache/conftool/dbconfig/20221024-165229-ladsgroup.json [16:53:06] (03PS3) 10Jbond: sre.swift.audit-labels: Audit the disk labels of swift backend hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 (https://phabricator.wikimedia.org/T308677) [16:53:37] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 1.78 ms [16:56:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:56:22] (03CR) 10CI reject: [V: 04-1] sre.swift.audit-labels: Audit the disk labels of swift backend hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/848427 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [16:56:53] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:58:53] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.138 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:00:05] ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221024T1700) [17:00:10] (03CR) 10Dzahn: [C: 03+2] gitlab_runner: add toolforge ci images to allowed list [puppet] - 10https://gerrit.wikimedia.org/r/848186 (owner: 10David Caro) [17:00:15] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48827 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:00:15] (03PS3) 10Dzahn: gitlab_runner: add toolforge ci images to allowed list [puppet] - 10https://gerrit.wikimedia.org/r/848186 (owner: 10David Caro) [17:00:39] (03CR) 10Dzahn: [V: 03+2] gitlab_runner: add toolforge ci images to allowed list [puppet] - 10https://gerrit.wikimedia.org/r/848186 (owner: 10David Caro) [17:03:00] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) >>! In T308677#8338658, @MatthewVernon wrote: > I find myself naively... [17:06:35] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P36078 and previous config saved to /var/cache/conftool/dbconfig/20221024-170735-ladsgroup.json [17:08:11] PROBLEM - Host parse1001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:16:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 1.624 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:17:19] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.922 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:17:59] (03CR) 10Dzahn: "afaict, from trying to do this before, if we could just replace this one line: "if $::site in keys($wikimedia_clusters['appserver']['sites" [puppet] - 10https://gerrit.wikimedia.org/r/845027 (owner: 10JHathaway) [17:20:31] RECOVERY - Host parse1001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [17:22:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P36079 and previous config saved to /var/cache/conftool/dbconfig/20221024-172242-ladsgroup.json [17:24:06] 10SRE, 10Cloud-Services, 10Datasets-General-or-Unknown, 10affects-Kiwix-and-openZIM, 10cloud-services-team (Kanban): Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503 (10Kelson) @nskaggs Not sure this is a question to me, but in the case it needed, could you please change... [17:28:30] (03PS1) 10Dzahn: dumps: switch kiwix download host to master.download.kiwix.org [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) [17:28:41] (03PS1) 10Herron: prometheus: update web_idp urls to prometheus-$site.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/848442 (https://phabricator.wikimedia.org/T301944) [17:32:08] (03CR) 10Dzahn: "also compare https://download.kiwix.org/ and https://master.download.kiwix.org/" [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [17:32:21] (03PS2) 10Dzahn: dumps: switch kiwix download host to master.download.kiwix.org [puppet] - 10https://gerrit.wikimedia.org/r/848441 (https://phabricator.wikimedia.org/T57503) [17:37:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T321312)', diff saved to https://phabricator.wikimedia.org/P36080 and previous config saved to /var/cache/conftool/dbconfig/20221024-173748-ladsgroup.json [17:37:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [17:38:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [17:38:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T321312)', diff saved to https://phabricator.wikimedia.org/P36081 and previous config saved to /var/cache/conftool/dbconfig/20221024-173812-ladsgroup.json [17:38:42] 10SRE, 10Wikimedia-Mailing-lists: Archive wikifr-l Mailing list - https://phabricator.wikimedia.org/T320312 (10Kelson) I have achieved to create account an retrieve admin access to the list, but really no clue how to remove the few messages (from the public archive). Not even sure this is possible at all. [17:41:23] (03CR) 10Jbond: [C: 03+1] wikimedia_clusters: remove id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845027 (owner: 10JHathaway) [17:44:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T321312)', diff saved to https://phabricator.wikimedia.org/P36082 and previous config saved to /var/cache/conftool/dbconfig/20221024-174431-ladsgroup.json [17:44:45] (03PS1) 10Dzahn: dumps: add sister projects to kiwix dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) [17:45:28] (03CR) 10CI reject: [V: 04-1] dumps: add sister projects to kiwix dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [17:46:04] (03CR) 10Herron: [C: 03+2] prometheus: update web_idp urls to prometheus-$site.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/848442 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [17:47:23] (03PS3) 10Jbond: ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) [17:48:12] (03CR) 10Jbond: ms-be2050: enable disks by path configuerations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:48:21] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37697/console" [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:49:09] (03CR) 10Dzahn: wikimedia_clusters: remove id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845027 (owner: 10JHathaway) [17:50:14] (03CR) 10Jbond: C:swift::storage: add variable for data directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:50:36] (03CR) 10Dzahn: [C: 04-1] "nice catch by "shellcheck". [SC2066] Since you double quoted this, it will not word split, and the loop will only run once." [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) (owner: 10Dzahn) [17:51:38] (03PS2) 10Dzahn: dumps: add sister projects to kiwix dumps rsync [puppet] - 10https://gerrit.wikimedia.org/r/848444 (https://phabricator.wikimedia.org/T57503) [17:51:51] (03CR) 10Volans: openstack: make domain-aware (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/845004 (https://phabricator.wikimedia.org/T321349) (owner: 10Andrew Bogott) [17:54:13] PROBLEM - prometheus.codfw.wikimedia.org requires authentication on prometheus2005 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:55:21] PROBLEM - prometheus.codfw.wikimedia.org tls expiry on prometheus2005 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:59:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P36083 and previous config saved to /var/cache/conftool/dbconfig/20221024-175938-ladsgroup.json [18:02:52] PROBLEM - prometheus.esams.wikimedia.org requires authentication on prometheus3001 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:03:01] when wikibugs and stashbot quit together but not net split = cloud problems [18:11:27] PROBLEM - prometheus.eqsin.wikimedia.org requires authentication on prometheus5001 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:12:01] PROBLEM - prometheus.eqsin.wikimedia.org tls expiry on prometheus5001 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:12:07] please ignore the prometheus requires authentication alerts [18:12:29] and tls expiry as well [18:14:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P36084 and previous config saved to /var/cache/conftool/dbconfig/20221024-181444-ladsgroup.json [18:19:39] PROBLEM - prometheus.codfw.wikimedia.org requires authentication on prometheus2006 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:20:01] PROBLEM - prometheus.drmrs.wikimedia.org requires authentication on prometheus6001 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:20:05] PROBLEM - prometheus.codfw.wikimedia.org tls expiry on prometheus2006 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:21:01] PROBLEM - prometheus.eqiad.wikimedia.org tls expiry on prometheus1005 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:21:05] PROBLEM - prometheus.eqiad.wikimedia.org requires authentication on prometheus1005 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:22:11] PROBLEM - prometheus.eqiad.wikimedia.org tls expiry on prometheus1006 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:23:25] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [18:24:25] (03PS1) 10Bartosz Dziewoński: Allow 'nofollow' on external links in Parsoid output [extensions/VisualEditor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/848390 (https://phabricator.wikimedia.org/T321437) [18:24:49] PROBLEM - prometheus.drmrs.wikimedia.org tls expiry on prometheus6001 is CRITICAL: CRITICAL - Cannot make SSL connection. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [18:25:14] (03PS1) 10Bartosz Dziewoński: Retry without RESTBase when the page/revision seems to be missing [extensions/DiscussionTools] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/848391 (https://phabricator.wikimedia.org/T315688) [18:26:59] (03PS3) 10Jbond: C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) [18:27:01] (03PS3) 10Jbond: P:swift::storage: add new resource to format via pci path [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677) [18:27:03] (03PS4) 10Jbond: ms-be2050: enable disks by path configuerations [puppet] - 10https://gerrit.wikimedia.org/r/848420 (https://phabricator.wikimedia.org/T308677) [18:27:05] (03PS1) 10Jbond: C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 [18:27:27] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [18:27:34] (03CR) 10CI reject: [V: 04-1] C:swift::storage: add variable for data directory [puppet] - 10https://gerrit.wikimedia.org/r/848418 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [18:29:30] (03CR) 10CI reject: [V: 04-1] C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 (owner: 10Jbond) [18:29:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T321312)', diff saved to https://phabricator.wikimedia.org/P36085 and previous config saved to /var/cache/conftool/dbconfig/20221024-182951-ladsgroup.json [18:29:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [18:30:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [18:30:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T321312)', diff saved to https://phabricator.wikimedia.org/P36086 and previous config saved to /var/cache/conftool/dbconfig/20221024-183015-ladsgroup.json [18:31:51] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [18:34:35] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [18:37:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T321312)', diff saved to https://phabricator.wikimedia.org/P36087 and previous config saved to /var/cache/conftool/dbconfig/20221024-183732-ladsgroup.json [18:42:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1112', diff saved to https://phabricator.wikimedia.org/P36088 and previous config saved to /var/cache/conftool/dbconfig/20221024-184239-ladsgroup.json [18:43:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repool db1112', diff saved to https://phabricator.wikimedia.org/P36089 and previous config saved to /var/cache/conftool/dbconfig/20221024-184359-ladsgroup.json [18:45:04] (03PS2) 10Jbond: C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 [18:46:05] RECOVERY - SSH on mw1328.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:46:10] (03CR) 10CI reject: [V: 04-1] C:swift: add swift disks fact [puppet] - 10https://gerrit.wikimedia.org/r/848451 (owner: 10Jbond) [18:47:14] (03CR) 10JHathaway: wikimedia_clusters: remove id (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845027 (owner: 10JHathaway) [18:47:20] (03PS2) 10JHathaway: wikimedia_clusters: remove id [puppet] - 10https://gerrit.wikimedia.org/r/845027 [18:47:44] (03PS1) 10Ladsgroup: Avoid using DBLoadBalancerFactoryConfigBuilder mw service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848453 (https://phabricator.wikimedia.org/T298485) [18:48:17] (03CR) 10JHathaway: [C: 03+2] wikimedia_clusters: remove id [puppet] - 10https://gerrit.wikimedia.org/r/845027 (owner: 10JHathaway) [18:49:51] PROBLEM - Check systemd state on mx2001 is CRITICAL: CRITICAL - degraded: The following units failed: generate_otrs_aliases.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:27] jouncebot: nowandnext [18:50:27] No deployments scheduled for the next 1 hour(s) and 9 minute(s) [18:50:27] In 1 hour(s) and 9 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221024T2000) [18:50:32] awesome [18:50:37] (03CR) 10Ladsgroup: [C: 03+2] Avoid using DBLoadBalancerFactoryConfigBuilder mw service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848453 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [18:50:41] (03Abandoned) 10David Caro: reprepro: add kubeadm-k8s-1-21/22 bullseye suite [puppet] - 10https://gerrit.wikimedia.org/r/848354 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro) [18:51:17] (03CR) 10David Caro: "Tested it on toolsbeta, works well, might add a couple more fixes (permissions mostly), but in another patch." [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro) [18:51:23] (03CR) 10David Caro: [V: 03+1] p::toolforge:harbor: use distro docker for bullseye [puppet] - 10https://gerrit.wikimedia.org/r/848356 (https://phabricator.wikimedia.org/T316541) (owner: 10David Caro) [18:51:29] (03Merged) 10jenkins-bot: Avoid using DBLoadBalancerFactoryConfigBuilder mw service [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848453 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [18:52:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [18:52:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [18:52:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T321312)', diff saved to https://phabricator.wikimedia.org/P36090 and previous config saved to /var/cache/conftool/dbconfig/20221024-185230-ladsgroup.json [18:52:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P36091 and previous config saved to /var/cache/conftool/dbconfig/20221024-185238-ladsgroup.json [18:53:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848453 (https://phabricator.wikimedia.org/T298485) (owner: 10Ladsgroup) [18:53:31] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:848453|Avoid using DBLoadBalancerFactoryConfigBuilder mw service (T298485)]] [18:53:35] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [18:53:50] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:848453|Avoid using DBLoadBalancerFactoryConfigBuilder mw service (T298485)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [18:54:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:55:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:55:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:55:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:58:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T321312)', diff saved to https://phabricator.wikimedia.org/P36092 and previous config saved to /var/cache/conftool/dbconfig/20221024-185856-ladsgroup.json [19:00:26] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:848453|Avoid using DBLoadBalancerFactoryConfigBuilder mw service (T298485)]] (duration: 06m 55s) [19:00:32] T298485: MW scripts should reload the database config - https://phabricator.wikimedia.org/T298485 [19:00:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:02:51] !log mforns@deploy1002 Started deploy [analytics/refinery@d3b7785] (thin): Regular analytics weekly train THIN [analytics/refinery@d3b7785] [19:02:58] !log mforns@deploy1002 Finished deploy [analytics/refinery@d3b7785] (thin): Regular analytics weekly train THIN [analytics/refinery@d3b7785] (duration: 00m 07s) [19:03:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:03:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:04:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:07:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P36093 and previous config saved to /var/cache/conftool/dbconfig/20221024-190745-ladsgroup.json [19:14:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P36094 and previous config saved to /var/cache/conftool/dbconfig/20221024-191403-ladsgroup.json [19:15:21] 10SRE, 10Znuny, 10serviceops-collab, 10Patch-For-Review: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10Arnoldokoth) Hey @jbond Seems like the merge broke Puppet on otrs1001.eqiad.wmnet. It fails with the following error: ` Error: Could not retrieve cata... [19:15:52] 10SRE, 10Znuny, 10serviceops-collab, 10Patch-For-Review: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10Arnoldokoth) 05Resolved→03Open [19:20:17] (03PS1) 10Ladsgroup: Add 'class' to LBFactory callback config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848477 [19:21:10] (03CR) 10Ladsgroup: [C: 03+2] Add 'class' to LBFactory callback config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848477 (owner: 10Ladsgroup) [19:21:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848477 (owner: 10Ladsgroup) [19:21:53] (03Merged) 10jenkins-bot: Add 'class' to LBFactory callback config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848477 (owner: 10Ladsgroup) [19:22:01] (03CR) 10Dzahn: P:mail::mx: move passwords to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845761 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [19:22:08] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:848477|Add 'class' to LBFactory callback config]] [19:22:27] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:848477|Add 'class' to LBFactory callback config]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [19:22:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T321312)', diff saved to https://phabricator.wikimedia.org/P36095 and previous config saved to /var/cache/conftool/dbconfig/20221024-192251-ladsgroup.json [19:22:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:23:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [19:24:19] (03CR) 10Dzahn: P:mail::mx: move passwords to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845761 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [19:24:44] 10SRE-swift-storage, 10Infrastructure-Foundations, 10Patch-For-Review: unstable device mapping of SSDs causing installer problems - example reimage with destruction of swift filesystem - https://phabricator.wikimedia.org/T308677 (10jbond) > luckily puppet doesn't relabel them I have noticed for the aux driv... [19:24:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:25:23] (03PS4) 10Jbond: P:swift::storage: add new resource to format via pci path [puppet] - 10https://gerrit.wikimedia.org/r/848419 (https://phabricator.wikimedia.org/T308677) [19:25:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:25:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:26:11] (03PS4) 10Jdlrobson: Promote several Wikipedias to desktop improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845060 (https://phabricator.wikimedia.org/T319012) [19:26:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:27:29] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:848477|Add 'class' to LBFactory callback config]] (duration: 05m 20s) [19:27:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:28:38] 10SRE, 10Znuny, 10serviceops-collab, 10Patch-For-Review: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10Dzahn) Seems like "unwrap" is ok in `.epp` templates but not in `.erb` templates. Should the exim .erb template be converted to an .epp template to... [19:29:08] (03CR) 10Jdlrobson: [C: 04-1] logos: Automate icon generation (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) (owner: 10Stang) [19:29:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P36096 and previous config saved to /var/cache/conftool/dbconfig/20221024-192909-ladsgroup.json [19:30:03] (03PS5) 10Jdlrobson: Unset some bad logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845035 [19:31:47] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:mail::mx: move passwords to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845761 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [19:32:01] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.581 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:32:03] (03PS1) 10Herron: prometheus: web_idp pin to prometheus(12)005 [puppet] - 10https://gerrit.wikimedia.org/r/848480 (https://phabricator.wikimedia.org/T301944) [19:32:17] (03CR) 10Dzahn: P:mail::mx: move passwords to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845761 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [19:33:48] (03CR) 10Dzahn: P:mail::mx: move passwords to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845761 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [19:34:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [19:34:36] (03CR) 10Herron: [C: 03+2] prometheus: web_idp pin to prometheus(12)005 [puppet] - 10https://gerrit.wikimedia.org/r/848480 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [19:34:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1098.eqiad.wmnet with reason: Maintenance [19:34:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36097 and previous config saved to /var/cache/conftool/dbconfig/20221024-193447-ladsgroup.json [19:34:49] (03PS1) 10Jbond: C:vtrs: dont unwrap this password as its not Sensitive [puppet] - 10https://gerrit.wikimedia.org/r/848481 (https://phabricator.wikimedia.org/T303272) [19:34:56] (03PS5) 10Stang: logos: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) [19:36:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36098 and previous config saved to /var/cache/conftool/dbconfig/20221024-193610-ladsgroup.json [19:36:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/37699/console" [puppet] - 10https://gerrit.wikimedia.org/r/848481 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [19:37:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:mail::mx: move passwords to hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/845761 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [19:37:29] mutante: https://gerrit.wikimedia.org/r/848481 [19:37:35] (03CR) 10Stang: logos: Automate icon generation (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) (owner: 10Stang) [19:39:10] jbond: but the "hide" at the beginning stays? [19:40:25] I can't say I get why the same thing on mx works, but I trust you :) [19:40:28] mutante: yes that stays its an exim config option [19:40:41] checked that puppet version is the same too [19:40:52] ack [19:40:58] mutante: for the mx serveres the value is rad in as a Sensitive[String] in vrts it is just a String [19:41:22] *read in i.e. it uses Senseitive[String] $foo = lookup [19:41:29] (03CR) 10Dzahn: [C: 03+1] C:vtrs: dont unwrap this password as its not Sensitive [puppet] - 10https://gerrit.wikimedia.org/r/848481 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [19:41:30] instead of String $foo lookup [19:41:56] jbond: aha! gotcha. I had never used the "Sensitive"/unwrap [19:42:02] +1, thanks for the quick reaction [19:42:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36099 and previous config saved to /var/cache/conftool/dbconfig/20221024-194219-ladsgroup.json [19:42:34] yes its pain for exacty theses reason i tend not to use it bunt then sometimes i think its probably not too bad, then i hit theses issues and regret it ;) [19:42:36] arnoldokoth: are you here?:) see above [19:42:45] it shuld get better in later version [19:43:00] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:vtrs: dont unwrap this password as its not Sensitive [puppet] - 10https://gerrit.wikimedia.org/r/848481 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [19:43:13] so it's NOT "you should convert your .erb to an .epp" [19:43:19] as I thought earlier [19:43:56] (03PS1) 10Herron: dns: add prometheus-$site.wm.o entries for prometheus web interface [dns] - 10https://gerrit.wikimedia.org/r/848488 (https://phabricator.wikimedia.org/T301944) [19:44:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T321312)', diff saved to https://phabricator.wikimedia.org/P36100 and previous config saved to /var/cache/conftool/dbconfig/20221024-194416-ladsgroup.json [19:44:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [19:44:46] mutante: i dont think there is a need to convert everything to epp. it has some advantages over erb but erb has soem over epp so its a judgment call i would say. however most newer code i see elseswhere theses days has started to move to epp [19:44:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [19:44:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T321312)', diff saved to https://phabricator.wikimedia.org/P36101 and previous config saved to /var/cache/conftool/dbconfig/20221024-194452-ladsgroup.json [19:45:24] but i haven't seen anything suggesting erb is being deprecated [19:46:34] jbond: ACK, makes sense. Just noticed how the same module is using both of them. for a moment it seemed like this was why. But unrelated [19:47:11] RECOVERY - Check systemd state on mx2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:14] runs puppet on otrs1001 [19:47:25] oh, look, mx2001 systemd state, heh [19:47:44] mutante: i just ran puppet on otrs and all looked good [19:47:45] otrs1001 puppet is happy now, thx [19:47:49] confirmed:) [19:47:49] np [19:48:28] 10SRE, 10Znuny, 10serviceops-collab, 10Patch-For-Review: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10jbond) 05Open→03Resolved >>! In T303272#8339600, @Arnoldokoth wrote: > Hey @jbond Seems like the merge broke Puppet on otrs1001.eqiad.wmnet. It fa... [19:51:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T321312)', diff saved to https://phabricator.wikimedia.org/P36102 and previous config saved to /var/cache/conftool/dbconfig/20221024-195128-ladsgroup.json [19:57:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P36103 and previous config saved to /var/cache/conftool/dbconfig/20221024-195725-ladsgroup.json [19:58:19] (03CR) 10Herron: [C: 03+2] dns: add prometheus-$site.wm.o entries for prometheus web interface [dns] - 10https://gerrit.wikimedia.org/r/848488 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [20:00:04] RoanKattouw, Urbanecm, and cjming: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221024T2000) [20:00:04] Jdlrobson and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] present [20:00:20] i can deploy today [20:00:37] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:00:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845035 (owner: 10Jdlrobson) [20:01:07] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:01:17] MatmaRex: hi, are you around? [20:01:29] 10SRE, 10Wikimedia-Mailing-lists: Archive wikifr-l Mailing list - https://phabricator.wikimedia.org/T320312 (10Ladsgroup) Go to the thread's archive. E.g. https://lists.wikimedia.org/hyperkitty/list/wikifr-l@lists.wikimedia.org/thread/WE77SYRXAWLJH3N2L7PZFVVURPOO4MKH/ At the right side: {F35622393} [20:01:41] (03Merged) 10jenkins-bot: Unset some bad logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845035 (owner: 10Jdlrobson) [20:01:56] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:845035|Unset some bad logos]] [20:02:14] (03PS5) 10Urbanecm: Promote several Wikipedias to desktop improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845060 (https://phabricator.wikimedia.org/T319012) (owner: 10Jdlrobson) [20:02:15] !log urbanecm@deploy1002 urbanecm and jdlrobson: Backport for [[gerrit:845035|Unset some bad logos]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:02:18] (03CR) 10Urbanecm: [C: 03+2] Promote several Wikipedias to desktop improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845060 (https://phabricator.wikimedia.org/T319012) (owner: 10Jdlrobson) [20:02:30] Jdlrobson: your first patch is at mwdebug1001, please check [20:02:34] looking [20:03:03] (03Merged) 10jenkins-bot: Promote several Wikipedias to desktop improvements group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845060 (https://phabricator.wikimedia.org/T319012) (owner: 10Jdlrobson) [20:03:56] LGTM [20:03:58] ^ urbanecm [20:04:03] hi, sorry [20:04:06] Jdlrobson: great, syncing [20:04:18] MatmaRex: no worries, +2'ing your backport and i'll ping you when they can be tested [20:04:25] (03CR) 10Urbanecm: [C: 03+2] Allow 'nofollow' on external links in Parsoid output [extensions/VisualEditor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/848390 (https://phabricator.wikimedia.org/T321437) (owner: 10Bartosz Dziewoński) [20:04:28] (03CR) 10Urbanecm: [C: 03+2] Retry without RESTBase when the page/revision seems to be missing [extensions/DiscussionTools] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/848391 (https://phabricator.wikimedia.org/T315688) (owner: 10Bartosz Dziewoński) [20:05:09] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.845 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:06:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P36104 and previous config saved to /var/cache/conftool/dbconfig/20221024-200634-ladsgroup.json [20:06:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48829 bytes in 8.593 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:07:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:08:03] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:845035|Unset some bad logos]] (duration: 06m 07s) [20:08:24] Jdlrobson: and live [20:08:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:08:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:08:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845060 (https://phabricator.wikimedia.org/T319012) (owner: 10Jdlrobson) [20:08:43] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:845060|Promote several Wikipedias to desktop improvements group (T319012)]] [20:08:48] T319012: [M] Deploy Vector 2022 skin to next set of wikis - https://phabricator.wikimedia.org/T319012 [20:09:03] !log urbanecm@deploy1002 urbanecm and jdlrobson: Backport for [[gerrit:845060|Promote several Wikipedias to desktop improvements group (T319012)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [20:09:03] urbanecm: yay! [20:09:08] Jdlrobson: please check your second patch at mwdebug1001 [20:09:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:09:29] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1001-dev.eqiad.wmnet [20:09:40] checking [20:10:39] LGTM urbanecm [20:10:43] syncng [20:12:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P36105 and previous config saved to /var/cache/conftool/dbconfig/20221024-201232-ladsgroup.json [20:13:23] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet [20:13:29] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:14:37] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:845060|Promote several Wikipedias to desktop improvements group (T319012)]] (duration: 05m 53s) [20:14:42] and synced [20:14:42] T319012: [M] Deploy Vector 2022 skin to next set of wikis - https://phabricator.wikimedia.org/T319012 [20:14:52] waiting for CI [20:15:03] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:15:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/848390 (https://phabricator.wikimedia.org/T321437) (owner: 10Bartosz Dziewoński) [20:15:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/848391 (https://phabricator.wikimedia.org/T315688) (owner: 10Bartosz Dziewoński) [20:15:33] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudbackup1002-dev.eqiad.wmnet [20:16:01] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:17:27] Hi urbanecm, I have put another two patches on the calendar [20:17:43] (hope it's not too late for now [20:17:46] it's not [20:17:57] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1005.wikimedia.org [20:18:31] but I had some conversation about logos/ with Jdlrobson. Jdlrobson: any objections to going ahead with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/847234? [20:19:09] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:19:33] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:20:46] (03Merged) 10jenkins-bot: Allow 'nofollow' on external links in Parsoid output [extensions/VisualEditor] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/848390 (https://phabricator.wikimedia.org/T321437) (owner: 10Bartosz Dziewoński) [20:20:49] (03Merged) 10jenkins-bot: Retry without RESTBase when the page/revision seems to be missing [extensions/DiscussionTools] (wmf/1.40.0-wmf.6) - 10https://gerrit.wikimedia.org/r/848391 (https://phabricator.wikimedia.org/T315688) (owner: 10Bartosz Dziewoński) [20:21:05] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:848390|Allow 'nofollow' on external links in Parsoid output (T321437)]], [[gerrit:848391|Retry without RESTBase when the page/revision seems to be missing (T315688)]] [20:21:07] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48828 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:21:11] T321437: Unlabelled external links disappear in visual editor - https://phabricator.wikimedia.org/T321437 [20:21:11] T315688: MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) from DiscussionTools (on open wikis) – permalinks unavailable for some edits - https://phabricator.wikimedia.org/T315688 [20:21:13] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:21:24] !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:848390|Allow 'nofollow' on external links in Parsoid output (T321437)]], [[gerrit:848391|Retry without RESTBase when the page/revision seems to be missing (T315688)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [20:21:38] MatmaRex: it's at mwdebug1001 now, can you check please? [20:21:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P36106 and previous config saved to /var/cache/conftool/dbconfig/20221024-202141-ladsgroup.json [20:21:50] yeah. looking [20:22:37] urbanecm: nope Stang's patch looks good to me [20:22:47] ack, thanks Jdlrobson! [20:22:58] we had a chat and we're working towards the same goal now I believe :) [20:23:07] (03CR) 10Jdlrobson: [C: 03+1] logos: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) (owner: 10Stang) [20:23:22] just making sure :) [20:23:23] urbanecm: VisualEditor looks good, DiscussionTools should be fine but it's not testable (it'll only be apparent in error logs) [20:23:30] ack, okay. syncing both. [20:23:39] and i greatly appreciate the work here - so nice to get those logo definitions out.of the main configuration. :) [20:23:53] and yes! [20:24:18] koi: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/847234 has merge conflict :/. can you fix it please? [20:24:28] (03PS2) 10Urbanecm: Add wmgSiteLogoVariants support to Chinese projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847309 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:24:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:24:33] ok, trying [20:25:07] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudbackup1002-dev.eqiad.wmnet [20:25:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:25:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:25:23] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:25:55] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:26:38] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices1005.wikimedia.org [20:27:24] (03PS6) 10Stang: logos: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) [20:27:29] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:27:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T321312)', diff saved to https://phabricator.wikimedia.org/P36107 and previous config saved to /var/cache/conftool/dbconfig/20221024-202738-ladsgroup.json [20:27:44] (03CR) 10Urbanecm: [C: 03+2] Add wmgSiteLogoVariants support to Chinese projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847309 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:27:44] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:848390|Allow 'nofollow' on external links in Parsoid output (T321437)]], [[gerrit:848391|Retry without RESTBase when the page/revision seems to be missing (T315688)]] (duration: 06m 38s) [20:27:49] T321437: Unlabelled external links disappear in visual editor - https://phabricator.wikimedia.org/T321437 [20:27:50] T315688: MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) from DiscussionTools (on open wikis) – permalinks unavailable for some edits - https://phabricator.wikimedia.org/T315688 [20:27:57] MatmaRex: both patches are live [20:27:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847309 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:28:24] thanks urbanecm [20:28:27] np [20:28:46] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1005.wikimedia.org [20:28:51] (03Merged) 10jenkins-bot: Add wmgSiteLogoVariants support to Chinese projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847309 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang) [20:28:52] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1006.wikimedia.org [20:29:05] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:847309|Add wmgSiteLogoVariants support to Chinese projects (T308620)]] [20:29:10] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [20:29:25] !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:847309|Add wmgSiteLogoVariants support to Chinese projects (T308620)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [20:29:52] koi: please your second patch (the zh one ^^) at mwdebug1001 [20:29:57] looking [20:30:17] (03PS7) 10Urbanecm: logos: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) (owner: 10Stang) [20:30:27] (03CR) 10Urbanecm: [C: 03+2] logos: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) (owner: 10Stang) [20:30:29] PROBLEM - glance-api http on cloudcontrol1005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 123 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:31:13] (03Merged) 10jenkins-bot: logos: Automate icon generation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) (owner: 10Stang) [20:32:03] koi: do we have plans to move the rules in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/847309/2/wmf-config/InitialiseSettings.php to logos/config.yaml as well? [20:32:04] urbanecm: I tested on all four projects with those logo variant defined in IS.php and it worked as expected, so LGTM [20:32:09] great, syncing [20:32:52] Jdlrobson: I thought it would bt too complex to be defined in config.yaml, sadly :( [20:33:22] also those are only used for Chinese projects [20:33:35] why too complex? [20:34:26] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices1005.wikimedia.org [20:34:48] I mean there would be too many keys in that yaml file, which somehow brings complexity [20:34:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36108 and previous config saved to /var/cache/conftool/dbconfig/20221024-203455-ladsgroup.json [20:35:07] https://www.irccloud.com/pastebin/zxJyWThe/ [20:35:16] it seems we support vairants in config? [20:35:57] hmm, you are right, and I haven't thought about this [20:35:59] So wouldn't this just be a case of adding use_wordmark / use_tagline there? [20:36:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:36:07] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:847309|Add wmgSiteLogoVariants support to Chinese projects (T308620)]] (duration: 07m 02s) [20:36:13] T308620: HIDPI support for logos among Chinese projects - https://phabricator.wikimedia.org/T308620 [20:36:22] will do if I have some time this week [20:36:37] selected_tagline , selected_wordmark [20:36:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/847234 (https://phabricator.wikimedia.org/T319223) (owner: 10Stang) [20:36:42] Shall I open a tiocket? [20:36:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T321312)', diff saved to https://phabricator.wikimedia.org/P36110 and previous config saved to /var/cache/conftool/dbconfig/20221024-203647-ladsgroup.json [20:36:49] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:847234|logos: Automate icon generation (T319223)]] [20:36:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [20:36:57] T319223: [XL] Deploy new set of logos for Wikipedias - https://phabricator.wikimedia.org/T319223 [20:37:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:37:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:37:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [20:37:09] !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:847234|logos: Automate icon generation (T319223)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:37:10] (03PS1) 10BCornwall: readme: Add general notes for testing deps [software/acme-chief] - 10https://gerrit.wikimedia.org/r/848512 [20:37:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T321312)', diff saved to https://phabricator.wikimedia.org/P36111 and previous config saved to /var/cache/conftool/dbconfig/20221024-203713-ladsgroup.json [20:37:37] koi: please test the other patch at mwdebug1001 [20:37:49] looking, it may take a while to test the logo one [20:38:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:38:15] sure [20:38:19] it's the last patch anyway [20:40:33] koi i made https://phabricator.wikimedia.org/T321519 :) [20:40:37] thanks for the help today urbanecm [20:40:41] no problem [20:41:01] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:41:17] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1006.wikimedia.org [20:41:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:41:32] (03PS2) 10Jdlrobson: Document check for broken symbolic links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845620 (https://phabricator.wikimedia.org/T319223) [20:41:34] urbanecm: I tested on nowikimedia(for "wikimedia" group), brwikimedia, zhwiki(for variant), kawiktionary(for "null"), wikimania2012wiki, and stewardwiki, they all looks fine for me [20:41:40] great! [20:41:41] urbanecm: what's the process for documentation only changes? [20:41:44] syncing [20:41:52] Do I nee to schedule a backport for https://gerrit.wikimedia.org/r/845620 [20:42:01] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:42:09] Jdlrobson: in operations/mediawiki-config? just find someone to +2 it (and pull to deployment host) [20:42:28] (03PS3) 10Jdlrobson: Document check for broken symbolic links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845620 (https://phabricator.wikimedia.org/T319223) [20:42:54] (03CR) 10Urbanecm: [C: 03+2] Document check for broken symbolic links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845620 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:42:57] lgtm, so, +2'ed :) [20:43:03] PROBLEM - glance-api http on cloudcontrol1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 123 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:43:36] (03Merged) 10jenkins-bot: Document check for broken symbolic links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845620 (https://phabricator.wikimedia.org/T319223) (owner: 10Jdlrobson) [20:44:05] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:44:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T321312)', diff saved to https://phabricator.wikimedia.org/P36112 and previous config saved to /var/cache/conftool/dbconfig/20221024-204446-ladsgroup.json [20:45:39] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:847234|logos: Automate icon generation (T319223)]] (duration: 08m 49s) [20:45:44] T319223: [XL] Deploy new set of logos for Wikipedias - https://phabricator.wikimedia.org/T319223 [20:45:48] koi: and, live [20:45:50] anything else, anyone? [20:46:16] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2085-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [20:47:44] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10Papaul) @cmooney hey I was about to set up sub-ports on fpc1 pic0 on both cr1-eqiad and cr2-eqiad and realized that lsw1-[e1-f1] are connected to pic0 of fpc1 o... [20:48:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:48:18] !log UTC late B&C window completed [20:48:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:49:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:49:38] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1007.wikimedia.org [20:49:44] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudservices1004.wikimedia.org [20:49:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:50:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P36113 and previous config saved to /var/cache/conftool/dbconfig/20221024-205002-ladsgroup.json [20:59:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P36114 and previous config saved to /var/cache/conftool/dbconfig/20221024-205953-ladsgroup.json [21:00:04] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221024T2100) [21:01:31] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudservices1004.wikimedia.org [21:02:34] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1003.wikimedia.org [21:04:13] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudcontrol1007.wikimedia.org [21:05:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P36115 and previous config saved to /var/cache/conftool/dbconfig/20221024-210508-ladsgroup.json [21:05:50] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: eqiad: Move links to new MPC7E linecard - https://phabricator.wikimedia.org/T304712 (10cmooney) @papaul hey. I think it can be done in any order. Probably best to hard down the port first to be safe (which will cause the CR to down the BGP sessi... [21:05:59] PROBLEM - glance-api http on cloudcontrol1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 123 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:06:48] !log uploaded python3-gjson_0.2.0 to apt.wikimedia.org bullseye-wikimedia [21:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:38] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1003.wikimedia.org [21:10:53] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1002.wikimedia.org [21:15:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P36116 and previous config saved to /var/cache/conftool/dbconfig/20221024-211500-ladsgroup.json [21:15:10] (03CR) 10TheDJ: Use the PDF cropbox for rendering (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/805476 (https://phabricator.wikimedia.org/T167420) (owner: 10TheDJ) [21:16:47] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1002.wikimedia.org [21:17:12] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudrabbit1001.wikimedia.org [21:20:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36117 and previous config saved to /var/cache/conftool/dbconfig/20221024-212016-ladsgroup.json [21:20:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [21:20:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance [21:20:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T321312)', diff saved to https://phabricator.wikimedia.org/P36118 and previous config saved to /var/cache/conftool/dbconfig/20221024-212041-ladsgroup.json [21:24:16] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudrabbit1001.wikimedia.org [21:26:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T321312)', diff saved to https://phabricator.wikimedia.org/P36119 and previous config saved to /var/cache/conftool/dbconfig/20221024-212644-ladsgroup.json [21:29:35] PROBLEM - nova-compute proc minimum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:29:46] PROBLEM - nova-compute proc minimum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:30:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T321312)', diff saved to https://phabricator.wikimedia.org/P36120 and previous config saved to /var/cache/conftool/dbconfig/20221024-213006-ladsgroup.json [21:30:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [21:30:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [21:30:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T321312)', diff saved to https://phabricator.wikimedia.org/P36121 and previous config saved to /var/cache/conftool/dbconfig/20221024-213032-ladsgroup.json [21:32:11] PROBLEM - nova-compute proc minimum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:36:53] PROBLEM - nova-compute proc maximum on cloudvirt1034 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:38:56] PROBLEM - nova-compute proc maximum on cloudvirt1031 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:38:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T321312)', diff saved to https://phabricator.wikimedia.org/P36122 and previous config saved to /var/cache/conftool/dbconfig/20221024-213859-ladsgroup.json [21:40:16] PROBLEM - nova-compute proc maximum on cloudvirt1017 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:41:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P36123 and previous config saved to /var/cache/conftool/dbconfig/20221024-214150-ladsgroup.json [21:47:15] PROBLEM - cinder-scheduler process on cloudcontrol1005 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:47:29] PROBLEM - cinder-scheduler process on cloudcontrol1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:49:21] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:49:26] PROBLEM - cinder-scheduler process on cloudcontrol1006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:51:23] RECOVERY - cinder-scheduler process on cloudcontrol1005 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:51:31] RECOVERY - cinder-scheduler process on cloudcontrol1006 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:51:37] RECOVERY - cinder-scheduler process on cloudcontrol1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python.* /usr/bin/cinder-scheduler https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:52:17] RECOVERY - nova-compute proc minimum on cloudvirt1031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:52:29] RECOVERY - nova-compute proc minimum on cloudvirt1034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:52:43] RECOVERY - nova-compute proc maximum on cloudvirt1017 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:52:49] RECOVERY - nova-compute proc minimum on cloudvirt1017 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:53:29] RECOVERY - nova-compute proc maximum on cloudvirt1031 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:53:31] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:53:31] RECOVERY - nova-compute proc maximum on cloudvirt1034 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:54:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P36124 and previous config saved to /var/cache/conftool/dbconfig/20221024-215405-ladsgroup.json [21:56:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P36125 and previous config saved to /var/cache/conftool/dbconfig/20221024-215657-ladsgroup.json [22:04:35] RECOVERY - glance-api http on cloudcontrol1007 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 1529 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:07:13] RECOVERY - glance-api http on cloudcontrol1006 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 1529 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:08:33] RECOVERY - glance-api http on cloudcontrol1005 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 1516 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:09:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P36126 and previous config saved to /var/cache/conftool/dbconfig/20221024-220912-ladsgroup.json [22:12:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T321312)', diff saved to https://phabricator.wikimedia.org/P36127 and previous config saved to /var/cache/conftool/dbconfig/20221024-221203-ladsgroup.json [22:12:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [22:12:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance [22:12:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1136 (T321312)', diff saved to https://phabricator.wikimedia.org/P36128 and previous config saved to /var/cache/conftool/dbconfig/20221024-221227-ladsgroup.json [22:12:45] PROBLEM - SSH on mw1332.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:16:57] PROBLEM - SSH on mw1334.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:18:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T321312)', diff saved to https://phabricator.wikimedia.org/P36129 and previous config saved to /var/cache/conftool/dbconfig/20221024-221845-ladsgroup.json [22:20:58] (03PS1) 10Dzahn: miscweb: add rsyslog::input::files to send apache logs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/848547 (https://phabricator.wikimedia.org/T216090) [22:23:40] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/pcc-worker1003/37700/miscweb2002.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/848547 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [22:24:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T321312)', diff saved to https://phabricator.wikimedia.org/P36130 and previous config saved to /var/cache/conftool/dbconfig/20221024-222418-ladsgroup.json [22:24:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [22:24:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [22:24:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T321312)', diff saved to https://phabricator.wikimedia.org/P36131 and previous config saved to /var/cache/conftool/dbconfig/20221024-222444-ladsgroup.json [22:27:56] (03CR) 10Dzahn: [C: 03+2] "apache logs in syslog:" [puppet] - 10https://gerrit.wikimedia.org/r/848547 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [22:31:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T321312)', diff saved to https://phabricator.wikimedia.org/P36133 and previous config saved to /var/cache/conftool/dbconfig/20221024-223109-ladsgroup.json [22:33:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P36134 and previous config saved to /var/cache/conftool/dbconfig/20221024-223352-ladsgroup.json [22:45:15] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:46:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P36135 and previous config saved to /var/cache/conftool/dbconfig/20221024-224616-ladsgroup.json [22:48:02] (03CR) 10Dzahn: [C: 03+2] "@cwhite Thank you. Now I see the apache logs in syslog on the miscweb* VMs. Should this be all or is there another step until I could expe" [puppet] - 10https://gerrit.wikimedia.org/r/848547 (https://phabricator.wikimedia.org/T216090) (owner: 10Dzahn) [22:48:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136', diff saved to https://phabricator.wikimedia.org/P36136 and previous config saved to /var/cache/conftool/dbconfig/20221024-224858-ladsgroup.json [23:00:25] !log on mwmaint1002 running renameInvalidUsernames.php for T292552 [23:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:30] T292552: Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 - https://phabricator.wikimedia.org/T292552 [23:01:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P36137 and previous config saved to /var/cache/conftool/dbconfig/20221024-230122-ladsgroup.json [23:04:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1136 (T321312)', diff saved to https://phabricator.wikimedia.org/P36138 and previous config saved to /var/cache/conftool/dbconfig/20221024-230405-ladsgroup.json [23:04:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [23:04:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [23:04:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:04:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:04:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T321312)', diff saved to https://phabricator.wikimedia.org/P36139 and previous config saved to /var/cache/conftool/dbconfig/20221024-230446-ladsgroup.json [23:10:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T321312)', diff saved to https://phabricator.wikimedia.org/P36140 and previous config saved to /var/cache/conftool/dbconfig/20221024-231058-ladsgroup.json [23:13:41] RECOVERY - SSH on mw1332.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:16:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T321312)', diff saved to https://phabricator.wikimedia.org/P36141 and previous config saved to /var/cache/conftool/dbconfig/20221024-231629-ladsgroup.json [23:16:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [23:16:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [23:16:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [23:17:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [23:17:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T321312)', diff saved to https://phabricator.wikimedia.org/P36142 and previous config saved to /var/cache/conftool/dbconfig/20221024-231721-ladsgroup.json [23:23:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T321312)', diff saved to https://phabricator.wikimedia.org/P36143 and previous config saved to /var/cache/conftool/dbconfig/20221024-232343-ladsgroup.json [23:26:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P36144 and previous config saved to /var/cache/conftool/dbconfig/20221024-232604-ladsgroup.json [23:28:07] (03PS1) 10Stang: Move wmgSiteLogoVariants to logos.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848552 (https://phabricator.wikimedia.org/T308620) [23:38:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P36145 and previous config saved to /var/cache/conftool/dbconfig/20221024-233849-ladsgroup.json [23:41:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P36146 and previous config saved to /var/cache/conftool/dbconfig/20221024-234111-ladsgroup.json [23:53:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P36147 and previous config saved to /var/cache/conftool/dbconfig/20221024-235357-ladsgroup.json [23:56:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T321312)', diff saved to https://phabricator.wikimedia.org/P36148 and previous config saved to /var/cache/conftool/dbconfig/20221024-235618-ladsgroup.json [23:56:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:56:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [23:56:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T321312)', diff saved to https://phabricator.wikimedia.org/P36149 and previous config saved to /var/cache/conftool/dbconfig/20221024-235645-ladsgroup.json [23:58:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T321312)', diff saved to https://phabricator.wikimedia.org/P36150 and previous config saved to /var/cache/conftool/dbconfig/20221024-235804-ladsgroup.json