[00:00:09] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:00:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172', diff saved to https://phabricator.wikimedia.org/P41505 and previous config saved to /var/cache/conftool/dbconfig/20221129-000042-marostegui.json [00:01:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T323907)', diff saved to https://phabricator.wikimedia.org/P41506 and previous config saved to /var/cache/conftool/dbconfig/20221129-000143-ladsgroup.json [00:01:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [00:01:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2109.codfw.wmnet with reason: Maintenance [00:01:50] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [00:01:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T323907)', diff saved to https://phabricator.wikimedia.org/P41507 and previous config saved to /var/cache/conftool/dbconfig/20221129-000153-ladsgroup.json [00:03:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T322618)', diff saved to https://phabricator.wikimedia.org/P41508 and previous config saved to /var/cache/conftool/dbconfig/20221129-000341-ladsgroup.json [00:03:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [00:03:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [00:03:48] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [00:04:46] (03PS1) 10Papaul: Fix typo on hostname [puppet] - 10https://gerrit.wikimedia.org/r/861499 (https://phabricator.wikimedia.org/T319433) [00:05:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [00:05:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [00:05:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P41509 and previous config saved to /var/cache/conftool/dbconfig/20221129-000545-ladsgroup.json [00:05:47] PROBLEM - MediaWiki EtcdConfig up-to-date on mw1344 is CRITICAL: etcd last index (1330553) is outdated compared to the master one (1330556) https://wikitech.wikimedia.org/wiki/Etcd [00:05:55] PROBLEM - MediaWiki EtcdConfig up-to-date on mw2334 is CRITICAL: etcd last index (1890921) is outdated compared to the master one (1890927) https://wikitech.wikimedia.org/wiki/Etcd [00:06:05] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:27] (03CR) 10Papaul: [C: 03+2] Fix typo on hostname [puppet] - 10https://gerrit.wikimedia.org/r/861499 (https://phabricator.wikimedia.org/T319433) (owner: 10Papaul) [00:07:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host arclamp1001.eqiad.wmnet with OS bullseye [00:07:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye [00:07:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P41510 and previous config saved to /var/cache/conftool/dbconfig/20221129-000729-ladsgroup.json [00:07:47] RECOVERY - MediaWiki EtcdConfig up-to-date on mw1344 is OK: etcd last index (1330559) matches the master one (1330559) https://wikitech.wikimedia.org/wiki/Etcd [00:07:55] RECOVERY - MediaWiki EtcdConfig up-to-date on mw2334 is OK: etcd last index (1890933) matches the master one (1890933) https://wikitech.wikimedia.org/wiki/Etcd [00:07:59] phab should be back. yell at us if anything gets weird. [00:12:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on arclamp1001.eqiad.wmnet with reason: host reimage [00:12:41] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10wiki_willy) a:03Jclark-ctr [00:15:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2172 (T321126)', diff saved to https://phabricator.wikimedia.org/P41511 and previous config saved to /var/cache/conftool/dbconfig/20221129-001548-marostegui.json [00:15:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2179.codfw.wmnet with reason: Maintenance [00:15:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2179.codfw.wmnet with reason: Maintenance [00:15:56] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [00:16:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2179 (T321126)', diff saved to https://phabricator.wikimedia.org/P41512 and previous config saved to /var/cache/conftool/dbconfig/20221129-001559-marostegui.json [00:16:09] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on arclamp1001.eqiad.wmnet with reason: host reimage [00:18:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T321126)', diff saved to https://phabricator.wikimedia.org/P41513 and previous config saved to /var/cache/conftool/dbconfig/20221129-001812-marostegui.json [00:18:51] (03CR) 10RLazarus: [C: 03+2] httpbb: Replace URL for metawiki test [puppet] - 10https://gerrit.wikimedia.org/r/861497 (https://phabricator.wikimedia.org/T323707) (owner: 10RLazarus) [00:19:41] (03PS3) 10Ssingh: [In case of emergency] depool eqsin for hardware refresh [dns] - 10https://gerrit.wikimedia.org/r/856664 [00:20:51] (03CR) 10Ssingh: "Rebased emergency patch. Please DO NOT merge unless there are issues with eqsin." [dns] - 10https://gerrit.wikimedia.org/r/856664 (owner: 10Ssingh) [00:22:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P41514 and previous config saved to /var/cache/conftool/dbconfig/20221129-002236-ladsgroup.json [00:27:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T323907)', diff saved to https://phabricator.wikimedia.org/P41515 and previous config saved to /var/cache/conftool/dbconfig/20221129-002707-ladsgroup.json [00:27:14] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [00:27:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P41516 and previous config saved to /var/cache/conftool/dbconfig/20221129-002742-ladsgroup.json [00:27:49] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [00:29:23] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host arclamp1001.eqiad.wmnet with OS bullseye [00:29:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host arclamp1001.eqiad.wmnet with OS bullseye completed: -... [00:29:58] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:33:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P41517 and previous config saved to /var/cache/conftool/dbconfig/20221129-003319-marostegui.json [00:33:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Papaul) [00:34:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Papaul) 05Open→03Resolved This is done [00:37:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41518 and previous config saved to /var/cache/conftool/dbconfig/20221129-003742-ladsgroup.json [00:37:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [00:37:50] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [00:37:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1101.eqiad.wmnet with reason: Maintenance [00:38:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41519 and previous config saved to /var/cache/conftool/dbconfig/20221129-003804-ladsgroup.json [00:39:59] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:42:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P41520 and previous config saved to /var/cache/conftool/dbconfig/20221129-004214-ladsgroup.json [00:42:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P41521 and previous config saved to /var/cache/conftool/dbconfig/20221129-004249-ladsgroup.json [00:44:06] (03CR) 10Catrope: [C: 03+1] Upgraded deployment-prep echostore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860941 (owner: 10Eevans) [00:46:36] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:48:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179', diff saved to https://phabricator.wikimedia.org/P41522 and previous config saved to /var/cache/conftool/dbconfig/20221129-004825-marostegui.json [00:53:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Papaul) @Jclark-ctr can you please confirm that those servers are connected to a 10G interface. @Marostegui @jcrespo I am trying to setup those servers and i don't kn... [00:55:50] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 117 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:57:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P41524 and previous config saved to /var/cache/conftool/dbconfig/20221129-005720-ladsgroup.json [00:57:42] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:57:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P41525 and previous config saved to /var/cache/conftool/dbconfig/20221129-005755-ladsgroup.json [01:03:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2179 (T321126)', diff saved to https://phabricator.wikimedia.org/P41526 and previous config saved to /var/cache/conftool/dbconfig/20221129-010332-marostegui.json [01:03:40] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [01:12:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T323907)', diff saved to https://phabricator.wikimedia.org/P41527 and previous config saved to /var/cache/conftool/dbconfig/20221129-011227-ladsgroup.json [01:12:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [01:12:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [01:12:34] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [01:13:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T322618)', diff saved to https://phabricator.wikimedia.org/P41528 and previous config saved to /var/cache/conftool/dbconfig/20221129-011302-ladsgroup.json [01:13:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [01:13:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [01:13:09] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [01:13:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P41529 and previous config saved to /var/cache/conftool/dbconfig/20221129-011312-ladsgroup.json [01:17:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P41530 and previous config saved to /var/cache/conftool/dbconfig/20221129-011707-ladsgroup.json [01:23:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) [01:25:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) @Andrew can yo please special the partman recipe to use for those servers in the task description? Thank you [01:26:00] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1054'] [01:26:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1054'] [01:26:35] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1054'] [01:27:08] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1054'] [01:32:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P41531 and previous config saved to /var/cache/conftool/dbconfig/20221129-013213-ladsgroup.json [01:37:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudvirt10[54-61].eqiad.wmnet - https://phabricator.wikimedia.org/T313983 (10Papaul) @Jclark-ctr can you please double check and confirm that all those servers are not R640 like it says in Netbox but there are R440? Thanks [01:41:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41532 and previous config saved to /var/cache/conftool/dbconfig/20221129-014116-ladsgroup.json [01:41:24] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P41533 and previous config saved to /var/cache/conftool/dbconfig/20221129-014720-ladsgroup.json [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P41534 and previous config saved to /var/cache/conftool/dbconfig/20221129-015623-ladsgroup.json [02:01:24] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:01:44] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:02:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T322618)', diff saved to https://phabricator.wikimedia.org/P41535 and previous config saved to /var/cache/conftool/dbconfig/20221129-020226-ladsgroup.json [02:02:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [02:02:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [02:02:34] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [02:02:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T322618)', diff saved to https://phabricator.wikimedia.org/P41536 and previous config saved to /var/cache/conftool/dbconfig/20221129-020237-ladsgroup.json [02:03:20] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:06:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T322618)', diff saved to https://phabricator.wikimedia.org/P41537 and previous config saved to /var/cache/conftool/dbconfig/20221129-020631-ladsgroup.json [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:11:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P41538 and previous config saved to /var/cache/conftool/dbconfig/20221129-021129-ladsgroup.json [02:17:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:21:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P41539 and previous config saved to /var/cache/conftool/dbconfig/20221129-022138-ladsgroup.json [02:26:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41540 and previous config saved to /var/cache/conftool/dbconfig/20221129-022636-ladsgroup.json [02:26:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance [02:26:43] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [02:26:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1127.eqiad.wmnet with reason: Maintenance [02:26:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T323907)', diff saved to https://phabricator.wikimedia.org/P41541 and previous config saved to /var/cache/conftool/dbconfig/20221129-022657-ladsgroup.json [02:28:44] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:32:10] (03CR) 10Andrew Bogott: [C: 03+2] Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [02:35:19] (03Merged) 10jenkins-bot: Add cookbook to restart openstack services [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/837751 (owner: 10Andrew Bogott) [02:36:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P41542 and previous config saved to /var/cache/conftool/dbconfig/20221129-023644-ladsgroup.json [02:44:59] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:48:34] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [02:49:58] (KubernetesAPILatency) firing: (6) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:50:32] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [02:51:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T322618)', diff saved to https://phabricator.wikimedia.org/P41543 and previous config saved to /var/cache/conftool/dbconfig/20221129-025151-ladsgroup.json [02:51:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [02:51:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [02:51:58] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [02:52:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T322618)', diff saved to https://phabricator.wikimedia.org/P41544 and previous config saved to /var/cache/conftool/dbconfig/20221129-025201-ladsgroup.json [02:55:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T322618)', diff saved to https://phabricator.wikimedia.org/P41545 and previous config saved to /var/cache/conftool/dbconfig/20221129-025556-ladsgroup.json [02:56:18] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 331 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:58:16] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T0300) [03:00:22] PROBLEM - SSH on mw1330.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:05:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T323907)', diff saved to https://phabricator.wikimedia.org/P41546 and previous config saved to /var/cache/conftool/dbconfig/20221129-030557-ladsgroup.json [03:06:05] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [03:06:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [03:07:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [03:07:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [03:07:37] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.12 [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) [03:07:41] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.12 [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [03:08:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [03:11:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P41547 and previous config saved to /var/cache/conftool/dbconfig/20221129-031103-ladsgroup.json [03:14:26] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [03:16:24] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [03:21:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P41548 and previous config saved to /var/cache/conftool/dbconfig/20221129-032103-ladsgroup.json [03:21:20] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/1.40.0-wmf.12 [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [03:26:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P41549 and previous config saved to /var/cache/conftool/dbconfig/20221129-032609-ladsgroup.json [03:36:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P41550 and previous config saved to /var/cache/conftool/dbconfig/20221129-033610-ladsgroup.json [03:38:02] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:40:00] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:41:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T322618)', diff saved to https://phabricator.wikimedia.org/P41551 and previous config saved to /var/cache/conftool/dbconfig/20221129-034116-ladsgroup.json [03:41:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [03:41:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1189.eqiad.wmnet with reason: Maintenance [03:41:23] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [03:41:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T322618)', diff saved to https://phabricator.wikimedia.org/P41552 and previous config saved to /var/cache/conftool/dbconfig/20221129-034126-ladsgroup.json [03:45:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T322618)', diff saved to https://phabricator.wikimedia.org/P41553 and previous config saved to /var/cache/conftool/dbconfig/20221129-034521-ladsgroup.json [03:51:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T323907)', diff saved to https://phabricator.wikimedia.org/P41554 and previous config saved to /var/cache/conftool/dbconfig/20221129-035116-ladsgroup.json [03:51:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [03:51:25] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [03:51:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1158.eqiad.wmnet with reason: Maintenance [03:51:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [03:51:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [03:51:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T323907)', diff saved to https://phabricator.wikimedia.org/P41555 and previous config saved to /var/cache/conftool/dbconfig/20221129-035144-ladsgroup.json [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T0400) [04:00:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P41556 and previous config saved to /var/cache/conftool/dbconfig/20221129-040027-ladsgroup.json [04:04:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [04:04:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [04:04:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [04:05:22] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:05:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [04:13:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [04:13:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [04:13:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T323907)', diff saved to https://phabricator.wikimedia.org/P41557 and previous config saved to /var/cache/conftool/dbconfig/20221129-041332-ladsgroup.json [04:13:40] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [04:15:20] (03PS6) 10Gergő Tisza: GrowthExperiments: End imagerecommendation experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859991 (https://phabricator.wikimedia.org/T323686) (owner: 10Kosta Harlan) [04:15:22] (03PS5) 10Gergő Tisza: GrowthExperiments: Start oldimpact experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860867 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [04:15:24] (03PS1) 10Gergő Tisza: [no-op] GrowthExperiments: Enable D3 in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861506 (https://phabricator.wikimedia.org/T318854) [04:15:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P41558 and previous config saved to /var/cache/conftool/dbconfig/20221129-041534-ladsgroup.json [04:19:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T323907)', diff saved to https://phabricator.wikimedia.org/P41559 and previous config saved to /var/cache/conftool/dbconfig/20221129-041912-ladsgroup.json [04:19:19] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [04:25:58] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:28:00] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:30:36] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:30:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T322618)', diff saved to https://phabricator.wikimedia.org/P41560 and previous config saved to /var/cache/conftool/dbconfig/20221129-043040-ladsgroup.json [04:30:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [04:30:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1198.eqiad.wmnet with reason: Maintenance [04:30:48] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [04:30:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P41561 and previous config saved to /var/cache/conftool/dbconfig/20221129-043050-ladsgroup.json [04:34:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P41562 and previous config saved to /var/cache/conftool/dbconfig/20221129-043418-ladsgroup.json [04:34:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P41563 and previous config saved to /var/cache/conftool/dbconfig/20221129-043445-ladsgroup.json [04:39:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T323907)', diff saved to https://phabricator.wikimedia.org/P41564 and previous config saved to /var/cache/conftool/dbconfig/20221129-043953-ladsgroup.json [04:40:00] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [04:44:48] (03PS1) 10Ladsgroup: mediawiki: Add quarterly cleanup of flaggedtemplates table [puppet] - 10https://gerrit.wikimedia.org/r/861507 (https://phabricator.wikimedia.org/T290769) [04:46:59] (03CR) 10CI reject: [V: 04-1] mediawiki: Add quarterly cleanup of flaggedtemplates table [puppet] - 10https://gerrit.wikimedia.org/r/861507 (https://phabricator.wikimedia.org/T290769) (owner: 10Ladsgroup) [04:49:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P41565 and previous config saved to /var/cache/conftool/dbconfig/20221129-044924-ladsgroup.json [04:49:43] (03PS2) 10Ladsgroup: mediawiki: Add quarterly cleanup of flaggedtemplates table [puppet] - 10https://gerrit.wikimedia.org/r/861507 (https://phabricator.wikimedia.org/T290769) [04:49:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P41566 and previous config saved to /var/cache/conftool/dbconfig/20221129-044952-ladsgroup.json [04:50:04] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 168 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:52:04] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:55:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P41567 and previous config saved to /var/cache/conftool/dbconfig/20221129-045459-ladsgroup.json [05:04:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T323907)', diff saved to https://phabricator.wikimedia.org/P41568 and previous config saved to /var/cache/conftool/dbconfig/20221129-050431-ladsgroup.json [05:04:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [05:04:39] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [05:04:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [05:04:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41569 and previous config saved to /var/cache/conftool/dbconfig/20221129-050453-ladsgroup.json [05:04:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P41570 and previous config saved to /var/cache/conftool/dbconfig/20221129-050458-ladsgroup.json [05:10:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P41571 and previous config saved to /var/cache/conftool/dbconfig/20221129-051006-ladsgroup.json [05:20:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T322618)', diff saved to https://phabricator.wikimedia.org/P41572 and previous config saved to /var/cache/conftool/dbconfig/20221129-052004-ladsgroup.json [05:20:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:20:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [05:20:13] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [05:25:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T323907)', diff saved to https://phabricator.wikimedia.org/P41573 and previous config saved to /var/cache/conftool/dbconfig/20221129-052512-ladsgroup.json [05:25:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [05:25:20] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [05:25:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [05:25:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [05:25:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance [05:25:38] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:25:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T323907)', diff saved to https://phabricator.wikimedia.org/P41574 and previous config saved to /var/cache/conftool/dbconfig/20221129-052538-ladsgroup.json [05:27:40] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:40:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T323907)', diff saved to https://phabricator.wikimedia.org/P41575 and previous config saved to /var/cache/conftool/dbconfig/20221129-054003-ladsgroup.json [05:40:11] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [05:40:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41576 and previous config saved to /var/cache/conftool/dbconfig/20221129-054029-ladsgroup.json [05:45:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 23 hosts with reason: Primary switchover s3 T323546 [05:45:56] T323546: Switchover s3 master (db1123 -> db1157) - https://phabricator.wikimedia.org/T323546 [05:46:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 23 hosts with reason: Primary switchover s3 T323546 [05:47:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1157 with weight 0 T323546', diff saved to https://phabricator.wikimedia.org/P41577 and previous config saved to /var/cache/conftool/dbconfig/20221129-054717-ladsgroup.json [05:52:28] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2014 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [05:55:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P41578 and previous config saved to /var/cache/conftool/dbconfig/20221129-055510-ladsgroup.json [05:55:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P41579 and previous config saved to /var/cache/conftool/dbconfig/20221129-055536-ladsgroup.json [06:10:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P41580 and previous config saved to /var/cache/conftool/dbconfig/20221129-061016-ladsgroup.json [06:10:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P41581 and previous config saved to /var/cache/conftool/dbconfig/20221129-061043-ladsgroup.json [06:18:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Marostegui) No, no need for IPv6 [06:22:24] (03PS2) 10Ladsgroup: mariadb: Promote db1157 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/858380 (https://phabricator.wikimedia.org/T323546) (owner: 10Gerrit maintenance bot) [06:22:28] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1157 to s3 master [puppet] - 10https://gerrit.wikimedia.org/r/858380 (https://phabricator.wikimedia.org/T323546) (owner: 10Gerrit maintenance bot) [06:23:30] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2014 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:25:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T323907)', diff saved to https://phabricator.wikimedia.org/P41582 and previous config saved to /var/cache/conftool/dbconfig/20221129-062523-ladsgroup.json [06:25:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [06:25:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [06:25:30] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [06:25:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T323907)', diff saved to https://phabricator.wikimedia.org/P41583 and previous config saved to /var/cache/conftool/dbconfig/20221129-062533-ladsgroup.json [06:25:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41584 and previous config saved to /var/cache/conftool/dbconfig/20221129-062549-ladsgroup.json [06:25:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:26:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1171.eqiad.wmnet with reason: Maintenance [06:35:30] 10SRE, 10SRE-Access-Requests, 10Security-Team: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10Marostegui) 05Open→03Stalled p:05Triage→03Medium Missing a few fields so far. [06:35:40] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:35:52] 10SRE, 10LDAP-Access-Requests, 10Security-Team: Add Kelton Hurd to wmf ldap group - https://phabricator.wikimedia.org/T323941 (10Marostegui) 05Open→03Stalled p:05Triage→03Medium Missing a few fields so far. [06:37:40] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:42:28] (03PS1) 10Marostegui: Revert "db2174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/861467 [06:43:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2140.codfw.wmnet with reason: Maintenance [06:43:57] (03CR) 10Marostegui: [C: 03+2] Revert "db2174: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/861467 (owner: 10Marostegui) [06:44:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2140.codfw.wmnet with reason: Maintenance [06:44:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2174 (re)pooling @ 10%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P41585 and previous config saved to /var/cache/conftool/dbconfig/20221129-064421-root.json [06:44:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1138.eqiad.wmnet with reason: Maintenance [06:45:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1138.eqiad.wmnet with reason: Maintenance [06:45:07] 10SRE, 10ops-codfw, 10DBA: db2174 lost power - https://phabricator.wikimedia.org/T323512 (10Marostegui) 05Open→03Resolved Host being repooled automatically. Notifications enabled. [06:46:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1099.eqiad.wmnet with reason: Maintenance [06:47:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1099.eqiad.wmnet with reason: Maintenance [06:47:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T321126)', diff saved to https://phabricator.wikimedia.org/P41586 and previous config saved to /var/cache/conftool/dbconfig/20221129-064721-marostegui.json [06:47:28] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [06:49:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321126)', diff saved to https://phabricator.wikimedia.org/P41587 and previous config saved to /var/cache/conftool/dbconfig/20221129-064945-marostegui.json [06:50:14] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST routes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:51:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T323907)', diff saved to https://phabricator.wikimedia.org/P41588 and previous config saved to /var/cache/conftool/dbconfig/20221129-065147-ladsgroup.json [06:51:57] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [06:57:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [06:57:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [06:57:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T323907)', diff saved to https://phabricator.wikimedia.org/P41589 and previous config saved to /var/cache/conftool/dbconfig/20221129-065741-ladsgroup.json [06:57:47] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [06:59:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2174 (re)pooling @ 25%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P41590 and previous config saved to /var/cache/conftool/dbconfig/20221129-065926-root.json [07:00:04] kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T0700). [07:00:07] !log Starting s3 eqiad failover from db1123 to db1157 - T323546 [07:00:10] let's go [07:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:13] T323546: Switchover s3 master (db1123 -> db1157) - https://phabricator.wikimedia.org/T323546 [07:00:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s3 eqiad as read-only for maintenance - T323546', diff saved to https://phabricator.wikimedia.org/P41591 and previous config saved to /var/cache/conftool/dbconfig/20221129-070032-ladsgroup.json [07:01:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1157 to s3 primary and set section read-write T323546', diff saved to https://phabricator.wikimedia.org/P41592 and previous config saved to /var/cache/conftool/dbconfig/20221129-070102-ladsgroup.json [07:04:06] RECOVERY - SSH on mw1330.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:04:36] (03PS2) 10Ladsgroup: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/858381 (https://phabricator.wikimedia.org/T323546) (owner: 10Gerrit maintenance bot) [07:04:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P41593 and previous config saved to /var/cache/conftool/dbconfig/20221129-070451-marostegui.json [07:05:19] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/858381 (https://phabricator.wikimedia.org/T323546) (owner: 10Gerrit maintenance bot) [07:06:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1123 T323546', diff saved to https://phabricator.wikimedia.org/P41594 and previous config saved to /var/cache/conftool/dbconfig/20221129-070637-ladsgroup.json [07:06:45] T323546: Switchover s3 master (db1123 -> db1157) - https://phabricator.wikimedia.org/T323546 [07:06:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P41595 and previous config saved to /var/cache/conftool/dbconfig/20221129-070653-ladsgroup.json [07:08:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:08:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:13:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T323907)', diff saved to https://phabricator.wikimedia.org/P41596 and previous config saved to /var/cache/conftool/dbconfig/20221129-071334-ladsgroup.json [07:13:42] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [07:14:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:14:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:14:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2174 (re)pooling @ 50%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P41597 and previous config saved to /var/cache/conftool/dbconfig/20221129-071431-root.json [07:16:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:16:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:19:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Joe) [07:19:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P41598 and previous config saved to /var/cache/conftool/dbconfig/20221129-071958-marostegui.json [07:20:43] (03PS4) 10Giuseppe Lavagetto: site: assign new appservers to their roles [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) [07:22:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P41599 and previous config saved to /var/cache/conftool/dbconfig/20221129-072159-ladsgroup.json [07:23:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [07:23:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2098.codfw.wmnet with reason: Maintenance [07:25:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] site: assign new appservers to their roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859964 (https://phabricator.wikimedia.org/T313327) (owner: 10Giuseppe Lavagetto) [07:26:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:26:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [07:28:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P41600 and previous config saved to /var/cache/conftool/dbconfig/20221129-072841-ladsgroup.json [07:29:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2174 (re)pooling @ 75%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P41601 and previous config saved to /var/cache/conftool/dbconfig/20221129-072936-root.json [07:30:43] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 199 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:31:43] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:35:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321126)', diff saved to https://phabricator.wikimedia.org/P41602 and previous config saved to /var/cache/conftool/dbconfig/20221129-073504-marostegui.json [07:35:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1105.eqiad.wmnet with reason: Maintenance [07:35:11] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [07:35:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1105.eqiad.wmnet with reason: Maintenance [07:35:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T321126)', diff saved to https://phabricator.wikimedia.org/P41603 and previous config saved to /var/cache/conftool/dbconfig/20221129-073525-marostegui.json [07:37:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T323907)', diff saved to https://phabricator.wikimedia.org/P41604 and previous config saved to /var/cache/conftool/dbconfig/20221129-073706-ladsgroup.json [07:37:13] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [07:37:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T321126)', diff saved to https://phabricator.wikimedia.org/P41605 and previous config saved to /var/cache/conftool/dbconfig/20221129-073748-marostegui.json [07:39:34] (03PS8) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [07:42:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [07:42:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [07:42:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P41606 and previous config saved to /var/cache/conftool/dbconfig/20221129-074229-ladsgroup.json [07:42:36] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [07:43:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P41607 and previous config saved to /var/cache/conftool/dbconfig/20221129-074347-ladsgroup.json [07:44:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2174 (re)pooling @ 100%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P41608 and previous config saved to /var/cache/conftool/dbconfig/20221129-074441-root.json [07:49:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P41609 and previous config saved to /var/cache/conftool/dbconfig/20221129-074951-ladsgroup.json [07:49:59] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [07:52:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P41610 and previous config saved to /var/cache/conftool/dbconfig/20221129-075254-marostegui.json [07:55:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [07:55:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2100.codfw.wmnet with reason: Maintenance [07:58:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T323907)', diff saved to https://phabricator.wikimedia.org/P41611 and previous config saved to /var/cache/conftool/dbconfig/20221129-075854-ladsgroup.json [07:58:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:59:01] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [07:59:24] PROBLEM - Check systemd state on mw1457 is CRITICAL: CRITICAL - degraded: The following units failed: nutcracker.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:59:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T323907)', diff saved to https://phabricator.wikimedia.org/P41612 and previous config saved to /var/cache/conftool/dbconfig/20221129-075937-ladsgroup.json [08:00:05] Amir1 and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:32] !log oblivian@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 42 hosts with reason: Appservers [08:03:25] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 42 hosts with reason: Appservers [08:04:55] (03PS9) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [08:04:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P41613 and previous config saved to /var/cache/conftool/dbconfig/20221129-080458-ladsgroup.json [08:05:39] (03CR) 10CI reject: [V: 04-1] C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [08:08:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P41614 and previous config saved to /var/cache/conftool/dbconfig/20221129-080801-marostegui.json [08:08:36] (03PS10) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [08:10:45] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T323970 (10phaultfinder) [08:10:57] (03CR) 10CI reject: [V: 04-1] C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [08:11:14] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:04] !log rebalance Ganeti group D/codfw following reboots [08:13:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:19] !log oblivian@cumin1001 START - Cookbook sre.hosts.reboot-single for host mw1457.eqiad.wmnet [08:14:17] (03CR) 10Muehlenhoff: [C: 03+2] Retire obsolete cloudvirt Partman recipes [puppet] - 10https://gerrit.wikimedia.org/r/859431 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:14:40] (03PS11) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [08:15:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T323907)', diff saved to https://phabricator.wikimedia.org/P41618 and previous config saved to /var/cache/conftool/dbconfig/20221129-081504-ladsgroup.json [08:15:11] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [08:18:06] RECOVERY - Check systemd state on mw1457 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P41619 and previous config saved to /var/cache/conftool/dbconfig/20221129-082004-ladsgroup.json [08:20:44] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T323970 (10phaultfinder) [08:22:51] (03PS1) 10Majavah: base: fix puppet_alert.py [puppet] - 10https://gerrit.wikimedia.org/r/861805 [08:23:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T321126)', diff saved to https://phabricator.wikimedia.org/P41620 and previous config saved to /var/cache/conftool/dbconfig/20221129-082307-marostegui.json [08:23:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1106.eqiad.wmnet with reason: Maintenance [08:23:15] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [08:23:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1106.eqiad.wmnet with reason: Maintenance [08:23:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:23:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:23:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T321126)', diff saved to https://phabricator.wikimedia.org/P41621 and previous config saved to /var/cache/conftool/dbconfig/20221129-082335-marostegui.json [08:24:31] !log oblivian@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host mw1457.eqiad.wmnet [08:25:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T321126)', diff saved to https://phabricator.wikimedia.org/P41622 and previous config saved to /var/cache/conftool/dbconfig/20221129-082558-marostegui.json [08:27:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2108.codfw.wmnet with reason: Maintenance [08:27:22] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:27:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2108.codfw.wmnet with reason: Maintenance [08:27:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T323907)', diff saved to https://phabricator.wikimedia.org/P41623 and previous config saved to /var/cache/conftool/dbconfig/20221129-082740-ladsgroup.json [08:27:46] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [08:30:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P41624 and previous config saved to /var/cache/conftool/dbconfig/20221129-083010-ladsgroup.json [08:32:13] (03CR) 10David Caro: [C: 03+2] base: fix puppet_alert.py [puppet] - 10https://gerrit.wikimedia.org/r/861805 (owner: 10Majavah) [08:35:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T322618)', diff saved to https://phabricator.wikimedia.org/P41625 and previous config saved to /var/cache/conftool/dbconfig/20221129-083511-ladsgroup.json [08:35:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [08:35:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2109.codfw.wmnet with reason: Maintenance [08:35:19] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [08:35:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T322618)', diff saved to https://phabricator.wikimedia.org/P41626 and previous config saved to /var/cache/conftool/dbconfig/20221129-083521-ladsgroup.json [08:37:41] gah, sorry about that sukhe herron ! thank you for taking care of it [08:40:50] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:41:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P41627 and previous config saved to /var/cache/conftool/dbconfig/20221129-084104-marostegui.json [08:41:35] (03PS12) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [08:43:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T322618)', diff saved to https://phabricator.wikimedia.org/P41628 and previous config saved to /var/cache/conftool/dbconfig/20221129-084302-ladsgroup.json [08:43:10] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [08:45:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P41629 and previous config saved to /var/cache/conftool/dbconfig/20221129-084517-ladsgroup.json [08:47:00] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: pool graphite1005 for reads [puppet] - 10https://gerrit.wikimedia.org/r/860522 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [08:52:54] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:55:05] (03PS1) 10Effie Mouzeli: mediawiki: remove nutcracker from application servers [puppet] - 10https://gerrit.wikimedia.org/r/861806 (https://phabricator.wikimedia.org/T277183) [08:55:56] (03PS2) 10Muehlenhoff: puppetmaster: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860909 (https://phabricator.wikimedia.org/T308013) [08:56:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P41630 and previous config saved to /var/cache/conftool/dbconfig/20221129-085611-marostegui.json [08:58:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P41631 and previous config saved to /var/cache/conftool/dbconfig/20221129-085809-ladsgroup.json [08:58:20] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:59:57] (03CR) 10Slyngshede: "Minor comment regarding running the Docker image on something like the M1 or M2." [puppet] - 10https://gerrit.wikimedia.org/r/860874 (owner: 10Jbond) [09:00:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T323907)', diff saved to https://phabricator.wikimedia.org/P41632 and previous config saved to /var/cache/conftool/dbconfig/20221129-090023-ladsgroup.json [09:00:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [09:00:31] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [09:00:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [09:00:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T323907)', diff saved to https://phabricator.wikimedia.org/P41633 and previous config saved to /var/cache/conftool/dbconfig/20221129-090044-ladsgroup.json [09:02:17] (03PS1) 10Majavah: Remove nutcracker from cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) [09:02:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T323907)', diff saved to https://phabricator.wikimedia.org/P41634 and previous config saved to /var/cache/conftool/dbconfig/20221129-090237-ladsgroup.json [09:02:38] (03CR) 10Muehlenhoff: [C: 03+2] puppetmaster: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860909 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:03:51] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [09:03:53] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38465/console" [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah) [09:04:43] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [09:04:53] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [09:05:18] (03PS2) 10Effie Mouzeli: mediawiki: remove nutcracker from application servers [puppet] - 10https://gerrit.wikimedia.org/r/861806 (https://phabricator.wikimedia.org/T277183) [09:11:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T321126)', diff saved to https://phabricator.wikimedia.org/P41635 and previous config saved to /var/cache/conftool/dbconfig/20221129-091117-marostegui.json [09:11:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1107.eqiad.wmnet with reason: Maintenance [09:11:25] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [09:11:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1107.eqiad.wmnet with reason: Maintenance [09:11:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1107 (T321126)', diff saved to https://phabricator.wikimedia.org/P41636 and previous config saved to /var/cache/conftool/dbconfig/20221129-091149-marostegui.json [09:12:02] (03CR) 10Effie Mouzeli: "PCC OK https://puppet-compiler.wmflabs.org/output/861806/38464/" [puppet] - 10https://gerrit.wikimedia.org/r/861806 (https://phabricator.wikimedia.org/T277183) (owner: 10Effie Mouzeli) [09:12:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2145', diff saved to https://phabricator.wikimedia.org/P41637 and previous config saved to /var/cache/conftool/dbconfig/20221129-091224-marostegui.json [09:13:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P41638 and previous config saved to /var/cache/conftool/dbconfig/20221129-091315-ladsgroup.json [09:14:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T321126)', diff saved to https://phabricator.wikimedia.org/P41639 and previous config saved to /var/cache/conftool/dbconfig/20221129-091412-marostegui.json [09:14:20] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38468/console" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [09:15:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 10%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P41640 and previous config saved to /var/cache/conftool/dbconfig/20221129-091534-root.json [09:17:03] (03PS1) 10Marostegui: control-mariadb-10.4-bullseye: Downgrade to 10.4.26 [software] - 10https://gerrit.wikimedia.org/r/861808 (https://phabricator.wikimedia.org/T323928) [09:17:13] !log update component/puppetdb7 to puppetdb 7.11.2-3 (fixing Postgres 15 compat) T321783 [09:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:20] T321783: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 [09:17:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T323907)', diff saved to https://phabricator.wikimedia.org/P41641 and previous config saved to /var/cache/conftool/dbconfig/20221129-091732-ladsgroup.json [09:17:39] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [09:17:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P41642 and previous config saved to /var/cache/conftool/dbconfig/20221129-091744-ladsgroup.json [09:17:55] (03CR) 10Elukey: "Left some minor comments to understand :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [09:18:17] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.4-bullseye: Downgrade to 10.4.26 [software] - 10https://gerrit.wikimedia.org/r/861808 (https://phabricator.wikimedia.org/T323928) (owner: 10Marostegui) [09:19:34] (03Merged) 10jenkins-bot: control-mariadb-10.4-bullseye: Downgrade to 10.4.26 [software] - 10https://gerrit.wikimedia.org/r/861808 (https://phabricator.wikimedia.org/T323928) (owner: 10Marostegui) [09:20:11] PROBLEM - mediawiki-installation DSH group on mw1457 is CRITICAL: Host mw1457 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [09:21:29] (03PS2) 10Muehlenhoff: rsyslog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860904 (https://phabricator.wikimedia.org/T308013) [09:27:12] (03PS2) 10Arturo Borrero Gonzalez: wmcs: libs: openstack: replace host_list() with hypervisor_list() [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861432 [09:28:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T322618)', diff saved to https://phabricator.wikimedia.org/P41643 and previous config saved to /var/cache/conftool/dbconfig/20221129-092822-ladsgroup.json [09:28:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [09:28:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2139.codfw.wmnet with reason: Maintenance [09:28:29] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [09:29:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P41644 and previous config saved to /var/cache/conftool/dbconfig/20221129-092918-marostegui.json [09:29:32] (03CR) 10Slyngshede: [V: 03+1] C:ldap::client::utils Rewrite add-ldap-group (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [09:30:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 25%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P41645 and previous config saved to /var/cache/conftool/dbconfig/20221129-093039-root.json [09:30:40] (03CR) 10CI reject: [V: 04-1] wmcs: libs: openstack: replace host_list() with hypervisor_list() [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861432 (owner: 10Arturo Borrero Gonzalez) [09:32:21] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P41646 and previous config saved to /var/cache/conftool/dbconfig/20221129-093239-ladsgroup.json [09:32:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P41647 and previous config saved to /var/cache/conftool/dbconfig/20221129-093250-ladsgroup.json [09:33:02] (03CR) 10David Caro: [C: 03+1] wmcs: libs: openstack: replace host_list() with hypervisor_list() (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861432 (owner: 10Arturo Borrero Gonzalez) [09:33:06] (03CR) 10Muehlenhoff: C:ldap::client::utils Rewrite add-ldap-group (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [09:33:56] (03PS3) 10Arturo Borrero Gonzalez: wmcs: libs: openstack: replace host_list() with hypervisor_list() [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861432 [09:34:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [09:34:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2149.codfw.wmnet with reason: Maintenance [09:34:15] (03CR) 10Arturo Borrero Gonzalez: wmcs: libs: openstack: replace host_list() with hypervisor_list() (032 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861432 (owner: 10Arturo Borrero Gonzalez) [09:34:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T322618)', diff saved to https://phabricator.wikimedia.org/P41648 and previous config saved to /var/cache/conftool/dbconfig/20221129-093420-ladsgroup.json [09:34:28] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [09:35:49] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 400 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:37:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: libs: openstack: replace host_list() with hypervisor_list() [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861432 (owner: 10Arturo Borrero Gonzalez) [09:38:00] (03PS2) 10Arturo Borrero Gonzalez: wmcs: openstack: lib: ensure_canary: fix changelist calculation [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861438 [09:38:11] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:42:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T322618)', diff saved to https://phabricator.wikimedia.org/P41649 and previous config saved to /var/cache/conftool/dbconfig/20221129-094212-ladsgroup.json [09:42:19] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [09:44:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P41650 and previous config saved to /var/cache/conftool/dbconfig/20221129-094424-marostegui.json [09:45:37] (03CR) 10Muehlenhoff: [C: 03+2] rsyslog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860904 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:45:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 50%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P41651 and previous config saved to /var/cache/conftool/dbconfig/20221129-094544-root.json [09:46:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10jcrespo) 10G is also not absolutely required at the moment. I personally would like to eventually have all dbs in a 10G for a fast backup recovery- and that is why we... [09:46:51] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:47:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P41652 and previous config saved to /var/cache/conftool/dbconfig/20221129-094745-ladsgroup.json [09:47:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T323907)', diff saved to https://phabricator.wikimedia.org/P41653 and previous config saved to /var/cache/conftool/dbconfig/20221129-094757-ladsgroup.json [09:47:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2120.codfw.wmnet with reason: Maintenance [09:48:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2120.codfw.wmnet with reason: Maintenance [09:48:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T323907)', diff saved to https://phabricator.wikimedia.org/P41654 and previous config saved to /var/cache/conftool/dbconfig/20221129-094818-ladsgroup.json [09:48:25] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [09:48:27] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:48:38] (03PS1) 10Clément Goubert: mediawiki::maintenance::campaignevents: meta [puppet] - 10https://gerrit.wikimedia.org/r/861813 (https://phabricator.wikimedia.org/T320403) [09:50:26] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38473/console" [puppet] - 10https://gerrit.wikimedia.org/r/861813 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert) [09:56:37] !log installing curl security updates [09:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P41655 and previous config saved to /var/cache/conftool/dbconfig/20221129-095718-ladsgroup.json [09:59:10] (03CR) 10Majavah: [C: 03+2] "retrying" [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [09:59:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T321126)', diff saved to https://phabricator.wikimedia.org/P41656 and previous config saved to /var/cache/conftool/dbconfig/20221129-095931-marostegui.json [09:59:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1118.eqiad.wmnet with reason: Maintenance [09:59:40] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [09:59:59] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST routes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:00:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1118.eqiad.wmnet with reason: Maintenance [10:00:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1118 (T321126)', diff saved to https://phabricator.wikimedia.org/P41657 and previous config saved to /var/cache/conftool/dbconfig/20221129-100025-marostegui.json [10:00:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 75%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P41658 and previous config saved to /var/cache/conftool/dbconfig/20221129-100049-root.json [10:02:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T321126)', diff saved to https://phabricator.wikimedia.org/P41659 and previous config saved to /var/cache/conftool/dbconfig/20221129-100248-marostegui.json [10:02:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T323907)', diff saved to https://phabricator.wikimedia.org/P41660 and previous config saved to /var/cache/conftool/dbconfig/20221129-100258-ladsgroup.json [10:03:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [10:03:04] (03CR) 10Vgutierrez: [C: 04-1] P:cache::haproxy: harden systemd unit (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [10:03:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [10:03:19] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [10:03:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T323907)', diff saved to https://phabricator.wikimedia.org/P41661 and previous config saved to /var/cache/conftool/dbconfig/20221129-100319-ladsgroup.json [10:04:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST routes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:05:45] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T323970 (10phaultfinder) [10:07:23] !log add temporary grants to scholarships for backups on db1117, db2160 T243037 [10:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:31] T243037: Shutdown scholarships.wikimedia.org and archive project - https://phabricator.wikimedia.org/T243037 [10:07:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:10:44] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T323970 (10phaultfinder) [10:10:49] (03PS2) 10Btullis: Promote the aqs_next role to be aqs [puppet] - 10https://gerrit.wikimedia.org/r/859059 (https://phabricator.wikimedia.org/T302278) [10:11:06] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P41662 and previous config saved to /var/cache/conftool/dbconfig/20221129-101225-ladsgroup.json [10:12:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:13:14] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/1.40.0-wmf.12 [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [10:15:49] !log upgrading puppetdb2003 to bookworm T321783 [10:15:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 100%: After HW maintenance', diff saved to https://phabricator.wikimedia.org/P41663 and previous config saved to /var/cache/conftool/dbconfig/20221129-101554-root.json [10:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:57] T321783: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 [10:16:46] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38474/console" [puppet] - 10https://gerrit.wikimedia.org/r/859059 (https://phabricator.wikimedia.org/T302278) (owner: 10Btullis) [10:17:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P41664 and previous config saved to /var/cache/conftool/dbconfig/20221129-101754-marostegui.json [10:19:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T323907)', diff saved to https://phabricator.wikimedia.org/P41665 and previous config saved to /var/cache/conftool/dbconfig/20221129-101958-ladsgroup.json [10:20:05] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [10:22:17] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:23:13] uh? [10:23:27] (03CR) 10Hnowlan: [C: 03+1] Update partman config for maps [puppet] - 10https://gerrit.wikimedia.org/r/861405 (owner: 10Muehlenhoff) [10:23:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T323907)', diff saved to https://phabricator.wikimedia.org/P41666 and previous config saved to /var/cache/conftool/dbconfig/20221129-102345-ladsgroup.json [10:24:45] there was a spike of logs but it was some minutes ago [10:24:45] (03PS2) 10Stevemunene: Add an-presto1006 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) [10:24:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST routes) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:26:09] !log restart kube-apiserver on ml-serve-ctrl* to clear out some knative controller issue [10:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:07] looking at general graphs I don't see anything unusual [10:27:17] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:27:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T322618)', diff saved to https://phabricator.wikimedia.org/P41667 and previous config saved to /var/cache/conftool/dbconfig/20221129-102731-ladsgroup.json [10:27:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [10:27:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2156.codfw.wmnet with reason: Maintenance [10:27:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [10:27:39] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:27:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2094.codfw.wmnet with reason: Maintenance [10:27:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T322618)', diff saved to https://phabricator.wikimedia.org/P41668 and previous config saved to /var/cache/conftool/dbconfig/20221129-102746-ladsgroup.json [10:28:08] (03PS1) 10Sergio Gimeno: ImageRecommendation: End experiment [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861815 (https://phabricator.wikimedia.org/T323686) [10:28:14] (03PS1) 10Sergio Gimeno: NewImpact: Prepare experiment [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861816 (https://phabricator.wikimedia.org/T323526) [10:28:42] could it be some glitch on load balancer or monitoring? [10:28:56] PROBLEM - SSH on mw1312.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:29:22] (03PS4) 10Slyngshede: ldap:management rewrite modify-mfa to use Bitu. [puppet] - 10https://gerrit.wikimedia.org/r/861385 [10:29:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:30:44] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T323970 (10phaultfinder) [10:30:53] !log revoke temporary grants to scholarships for backups on db1117, db2160 T243037 [10:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:00] T243037: Shutdown scholarships.wikimedia.org and archive project - https://phabricator.wikimedia.org/T243037 [10:31:03] (03CR) 10Slyngshede: ldap:management rewrite modify-mfa to use Bitu. (0314 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [10:31:46] jynus: it isn't traffic related, "Prometheus in eqiad is unable to scrape metrics for 80.77% of cluster api_appserver." [10:31:58] jynus: I'd say that the alert summary is quite misleading [10:32:22] indeed the new prometheus-based alerts may need a review [10:32:27] on the text [10:33:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P41669 and previous config saved to /var/cache/conftool/dbconfig/20221129-103301-marostegui.json [10:33:03] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:33:41] (03PS1) 10Sergio Gimeno: refreshUserImpactData.php: Add force and dry-run flags [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861817 (https://phabricator.wikimedia.org/T322541) [10:34:47] (03PS1) 10Sergio Gimeno: refreshUserImpactData.php: Add minimum edit filter [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861818 (https://phabricator.wikimedia.org/T323958) [10:34:51] _joe_: could this be related to you adding new appservers (AppserversUnreachable) ^ [10:34:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:35:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P41670 and previous config saved to /var/cache/conftool/dbconfig/20221129-103505-ladsgroup.json [10:35:08] <_joe_> jayme: what? [10:35:16] firing: Appserver unavailable for cluster api_appserver at eqiad [10:35:23] <_joe_> jayme: no I doubt it's related [10:35:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T322618)', diff saved to https://phabricator.wikimedia.org/P41671 and previous config saved to /var/cache/conftool/dbconfig/20221129-103524-ladsgroup.json [10:35:30] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:35:31] <_joe_> if it's 80% of servers [10:35:52] <_joe_> but thes servers are not in LVS [10:35:56] <_joe_> and bieng installed right now [10:36:01] <_joe_> so yeah, possible [10:36:34] not sure where the number comes from and did not check the actual querry...I was just wondering because of the time correlation [10:36:46] <_joe_> this is the problem of using puppetdb in such moments [10:36:58] <_joe_> anyways, I don't think there's a real issue [10:37:12] <_joe_> also those servers now should correctly reply to probes [10:37:45] the comment was about making the alert more precise [10:37:48] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:38:09] e.g. "X percent of app servers cannot be accessed from Y" [10:38:43] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/854574 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:38:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P41672 and previous config saved to /var/cache/conftool/dbconfig/20221129-103852-ladsgroup.json [10:39:05] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 104 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:39:39] (03CR) 10Jbond: [C: 03+1] analytics::refinery: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/858604 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:40:06] (03CR) 10Muehlenhoff: "Looks good, two nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [10:40:19] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:42:01] (03PS1) 10Jgiannelos: restbase-beta: Change wikifeeds URI for deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/861821 (https://phabricator.wikimedia.org/T306068) [10:44:09] (03CR) 10CI reject: [V: 04-1] ImageRecommendation: End experiment [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861815 (https://phabricator.wikimedia.org/T323686) (owner: 10Sergio Gimeno) [10:45:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool: add the new servers [puppet] - 10https://gerrit.wikimedia.org/r/859965 (https://phabricator.wikimedia.org/T313327) (owner: 10Giuseppe Lavagetto) [10:45:36] (03PS4) 10Giuseppe Lavagetto: conftool: add the new servers [puppet] - 10https://gerrit.wikimedia.org/r/859965 (https://phabricator.wikimedia.org/T313327) [10:46:10] (03CR) 10Jbond: [C: 03+1] "LGTM barring the last few comments" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [10:48:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T321126)', diff saved to https://phabricator.wikimedia.org/P41673 and previous config saved to /var/cache/conftool/dbconfig/20221129-104807-marostegui.json [10:48:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1119.eqiad.wmnet with reason: Maintenance [10:48:14] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [10:48:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1119.eqiad.wmnet with reason: Maintenance [10:48:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T321126)', diff saved to https://phabricator.wikimedia.org/P41674 and previous config saved to /var/cache/conftool/dbconfig/20221129-104828-marostegui.json [10:48:35] !log stopping puppet on maps* for casssandra removal [10:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:41] !log oblivian@puppetmaster1001 conftool action : set/weight=30; selector: cluster=appserver,dc=eqiad,name=mw14[7-9].* [10:49:38] (03CR) 10CI reject: [V: 04-1] refreshUserImpactData.php: Add force and dry-run flags [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861817 (https://phabricator.wikimedia.org/T322541) (owner: 10Sergio Gimeno) [10:49:40] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: remove Cassandra and Tilerator service [puppet] - 10https://gerrit.wikimedia.org/r/860634 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [10:49:59] !log oblivian@puppetmaster1001 conftool action : set/weight=30; selector: cluster=api_appserver,dc=eqiad,name=mw14[6-9].* [10:50:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P41675 and previous config saved to /var/cache/conftool/dbconfig/20221129-105011-ladsgroup.json [10:50:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P41676 and previous config saved to /var/cache/conftool/dbconfig/20221129-105030-ladsgroup.json [10:50:45] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T323970 (10phaultfinder) [10:50:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T321126)', diff saved to https://phabricator.wikimedia.org/P41677 and previous config saved to /var/cache/conftool/dbconfig/20221129-105050-marostegui.json [10:52:31] <_joe_> jouncebot: now_and_next [10:52:39] <_joe_> jouncebot: now [10:52:39] No deployments scheduled for the next 3 hour(s) and 7 minute(s) [10:52:45] <_joe_> jouncebot: next [10:52:45] In 3 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T1400) [10:52:45] In 3 hour(s) and 7 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T1400) [10:52:52] <_joe_> okk [10:53:41] (03PS5) 10Slyngshede: ldap:management rewrite modify-mfa to use Bitu. [puppet] - 10https://gerrit.wikimedia.org/r/861385 [10:53:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P41678 and previous config saved to /var/cache/conftool/dbconfig/20221129-105358-ladsgroup.json [10:55:45] <_joe_> !log new appservers are in rotation T313327 [10:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:51] T313327: Put mw14[57-98] in production - https://phabricator.wikimedia.org/T313327 [10:55:52] (03CR) 10CI reject: [V: 04-1] ldap:management rewrite modify-mfa to use Bitu. [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [10:56:31] (03PS6) 10Slyngshede: ldap:management rewrite modify-mfa to use Bitu. [puppet] - 10https://gerrit.wikimedia.org/r/861385 [10:57:16] (03CR) 10Jbond: ldap:management rewrite modify-mfa to use Bitu. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [10:57:50] (03CR) 10Jbond: [C: 03+1] ldap:management rewrite modify-mfa to use Bitu. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [10:58:45] !log oblivian@puppetmaster1001 conftool action : set/weight=10; selector: cluster=(jobrunner|videoscaler),dc=eqiad,name=mw14[5-9].* [10:58:56] (03PS1) 10Muehlenhoff: puppetdb: Bump postgres version for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/861823 [10:59:51] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:00:51] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:03:39] (03PS2) 10Jgiannelos: restbase-beta: Change wikifeeds URI for deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/861821 (https://phabricator.wikimedia.org/T306068) [11:03:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [11:04:07] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10Platform Team Workboards (Platform Engineering Reliability): 3d2png failing in Kubernetes - https://phabricator.wikimedia.org/T323936 (10JMeybohm) > However, according to graphs memory usage is looking pretty meager - something is missing. Keep in m... [11:04:17] (03CR) 10CI reject: [V: 04-1] restbase-beta: Change wikifeeds URI for deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/861821 (https://phabricator.wikimedia.org/T306068) (owner: 10Jgiannelos) [11:04:27] PROBLEM - mediawiki-installation DSH group on mw1494 is CRITICAL: Host mw1494 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:05:11] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:05:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T323907)', diff saved to https://phabricator.wikimedia.org/P41680 and previous config saved to /var/cache/conftool/dbconfig/20221129-110518-ladsgroup.json [11:05:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [11:05:26] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [11:05:27] (03PS3) 10Jgiannelos: restbase-beta: Change wikifeeds URI for deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/861821 (https://phabricator.wikimedia.org/T306068) [11:05:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10EChetty) [11:05:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [11:05:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P41681 and previous config saved to /var/cache/conftool/dbconfig/20221129-110537-ladsgroup.json [11:05:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T323907)', diff saved to https://phabricator.wikimedia.org/P41682 and previous config saved to /var/cache/conftool/dbconfig/20221129-110546-ladsgroup.json [11:06:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P41683 and previous config saved to /var/cache/conftool/dbconfig/20221129-110559-marostegui.json [11:06:09] PROBLEM - tileratorui on maps2005 is CRITICAL: connect to address 10.192.0.155 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:06:13] PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: tileratorui.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:17] PROBLEM - tileratorui on maps2010 is CRITICAL: connect to address 10.192.48.166 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:06:17] PROBLEM - tilerator on maps2008 is CRITICAL: connect to address 10.192.48.165 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:06:27] PROBLEM - Check systemd state on maps2005 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service,tileratorui.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:30] (03CR) 10Vgutierrez: [C: 04-1] restbase-beta: Change wikifeeds URI for deployment prep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861821 (https://phabricator.wikimedia.org/T306068) (owner: 10Jgiannelos) [11:06:31] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: tileratorui.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:31] PROBLEM - Check systemd state on maps2006 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service,tileratorui.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:41] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: tileratorui.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:06:43] PROBLEM - tileratorui on maps1006 is CRITICAL: connect to address 10.64.0.18 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:06:49] This was me ^ [11:06:51] PROBLEM - tileratorui on maps2009 is CRITICAL: connect to address 10.192.16.107 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:09:01] PROBLEM - tilerator on maps2005 is CRITICAL: connect to address 10.192.0.155 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:09:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T323907)', diff saved to https://phabricator.wikimedia.org/P41684 and previous config saved to /var/cache/conftool/dbconfig/20221129-110905-ladsgroup.json [11:09:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [11:09:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2121.codfw.wmnet with reason: Maintenance [11:09:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T323907)', diff saved to https://phabricator.wikimedia.org/P41685 and previous config saved to /var/cache/conftool/dbconfig/20221129-110926-ladsgroup.json [11:09:33] (03CR) 10Vgutierrez: thanos: add thanos-web to catalog and frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861412 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [11:10:21] !log oblivian@cumin1001 START - Cookbook sre.hosts.remove-downtime for 42 hosts [11:10:34] !log oblivian@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 42 hosts [11:10:45] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T323970 (10phaultfinder) [11:11:32] (03CR) 10Muehlenhoff: [C: 03+2] Update partman config for maps [puppet] - 10https://gerrit.wikimedia.org/r/861405 (owner: 10Muehlenhoff) [11:12:01] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38477/console" [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [11:12:37] RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:55] <_joe_> sigh, you might get some mediawiki dsh group alert [11:12:56] (03PS3) 10Arturo Borrero Gonzalez: P:openstack: explicit rules for haproxy backend traffic POC [puppet] - 10https://gerrit.wikimedia.org/r/854875 (owner: 10Majavah) [11:13:01] RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:06] <_joe_> they're all just due to icinga being slow [11:13:21] RECOVERY - Check systemd state on maps2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:25] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:25] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:25] RECOVERY - Check systemd state on maps2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:13:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] P:openstack: explicit rules for haproxy backend traffic POC (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854875 (owner: 10Majavah) [11:13:52] (03PS1) 10Filippo Giunchedi: prometheus: move traffic rules off 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/861825 (https://phabricator.wikimedia.org/T288196) [11:13:55] PROBLEM - tilerator on maps2007 is CRITICAL: connect to address 10.192.32.46 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:14:14] (03CR) 10Vgutierrez: [C: 03+1] Add thanos-web.svc and discovery [dns] - 10https://gerrit.wikimedia.org/r/861396 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [11:14:40] (03CR) 10Filippo Giunchedi: "Essentially this drops the "site_" and "global_" prefixes, and removes some obsolete rules" [puppet] - 10https://gerrit.wikimedia.org/r/861825 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [11:15:45] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T323970 (10phaultfinder) [11:15:47] PROBLEM - tileratorui on maps2007 is CRITICAL: connect to address 10.192.32.46 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:16:05] RECOVERY - mediawiki-installation DSH group on mw1457 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:16:44] (03CR) 10Vgutierrez: [C: 03+1] conftool: add thanos-web service [puppet] - 10https://gerrit.wikimedia.org/r/861411 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [11:18:27] (03PS2) 10Filippo Giunchedi: thanos: add thanos-web to catalog and frontend [puppet] - 10https://gerrit.wikimedia.org/r/861412 (https://phabricator.wikimedia.org/T323913) [11:18:34] (03CR) 10Filippo Giunchedi: thanos: add thanos-web to catalog and frontend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861412 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [11:19:41] (03PS2) 10Filippo Giunchedi: prometheus: move traffic rules off 'global' instance [puppet] - 10https://gerrit.wikimedia.org/r/861825 (https://phabricator.wikimedia.org/T288196) [11:20:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T322618)', diff saved to https://phabricator.wikimedia.org/P41686 and previous config saved to /var/cache/conftool/dbconfig/20221129-112043-ladsgroup.json [11:20:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [11:20:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2177.codfw.wmnet with reason: Maintenance [11:20:51] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:20:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T322618)', diff saved to https://phabricator.wikimedia.org/P41687 and previous config saved to /var/cache/conftool/dbconfig/20221129-112053-ladsgroup.json [11:21:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P41688 and previous config saved to /var/cache/conftool/dbconfig/20221129-112106-marostegui.json [11:21:22] (03CR) 10Filippo Giunchedi: [C: 03+2] Add thanos-web.svc and discovery [dns] - 10https://gerrit.wikimedia.org/r/861396 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [11:21:26] (03PS2) 10Filippo Giunchedi: Add thanos-web.svc and discovery [dns] - 10https://gerrit.wikimedia.org/r/861396 (https://phabricator.wikimedia.org/T323913) [11:22:25] (03PS1) 10Filippo Giunchedi: prometheus: deprecate traffic 'global' rules [puppet] - 10https://gerrit.wikimedia.org/r/861826 (https://phabricator.wikimedia.org/T288196) [11:22:56] (03CR) 10Filippo Giunchedi: "To be merged once the new rules have accumulated enough data" [puppet] - 10https://gerrit.wikimedia.org/r/861826 (https://phabricator.wikimedia.org/T288196) (owner: 10Filippo Giunchedi) [11:23:09] RECOVERY - mediawiki-installation DSH group on mw1494 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:23:49] PROBLEM - tilerator on maps2006 is CRITICAL: connect to address 10.192.16.31 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:23:49] PROBLEM - tilerator on maps2010 is CRITICAL: connect to address 10.192.48.166 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [11:26:33] PROBLEM - tileratorui on maps1010 is CRITICAL: connect to address 10.64.48.6 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:26:33] PROBLEM - tileratorui on maps2006 is CRITICAL: connect to address 10.192.16.31 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [11:26:49] PROBLEM - mediawiki-installation DSH group on mw1307 is CRITICAL: Host mw1307 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [11:28:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T322618)', diff saved to https://phabricator.wikimedia.org/P41689 and previous config saved to /var/cache/conftool/dbconfig/20221129-112835-ladsgroup.json [11:28:42] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:29:38] (03PS1) 10Filippo Giunchedi: Revert thanos-web discovery record [dns] - 10https://gerrit.wikimedia.org/r/861827 (https://phabricator.wikimedia.org/T323913) [11:30:13] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:47] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert thanos-web discovery record [dns] - 10https://gerrit.wikimedia.org/r/861827 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [11:31:00] (03CR) 10Jgiannelos: restbase-beta: Change wikifeeds URI for deployment prep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861821 (https://phabricator.wikimedia.org/T306068) (owner: 10Jgiannelos) [11:31:37] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 213 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:31:45] <_joe_> uhm [11:31:52] <_joe_> looking [11:32:39] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:34:31] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [11:34:48] (03CR) 10Filippo Giunchedi: [C: 03+2] conftool: add thanos-web service [puppet] - 10https://gerrit.wikimedia.org/r/861411 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [11:34:50] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [11:34:55] (03PS2) 10Filippo Giunchedi: conftool: add thanos-web service [puppet] - 10https://gerrit.wikimedia.org/r/861411 (https://phabricator.wikimedia.org/T323913) [11:34:59] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:36:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T321126)', diff saved to https://phabricator.wikimedia.org/P41690 and previous config saved to /var/cache/conftool/dbconfig/20221129-113612-marostegui.json [11:36:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1128.eqiad.wmnet with reason: Maintenance [11:36:20] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [11:36:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1128.eqiad.wmnet with reason: Maintenance [11:36:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T321126)', diff saved to https://phabricator.wikimedia.org/P41691 and previous config saved to /var/cache/conftool/dbconfig/20221129-113633-marostegui.json [11:37:28] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [11:37:39] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [11:37:43] !log uploaded ferm 2.5.1-1.1+wmf11u1 to apt.wikimedia.org/bookworm (rebasing our systemd startup fixes to what's in bookworm) T321783 [11:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:54] T321783: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 [11:38:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T321126)', diff saved to https://phabricator.wikimedia.org/P41692 and previous config saved to /var/cache/conftool/dbconfig/20221129-113854-marostegui.json [11:40:12] (03CR) 10Volans: [C: 03+1] "LGTM, minor nits/questions inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [11:40:44] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T323970 (10phaultfinder) [11:42:27] (03PS7) 10Slyngshede: ldap:management rewrite modify-mfa to use Bitu. [puppet] - 10https://gerrit.wikimedia.org/r/861385 [11:42:33] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [11:42:39] (03CR) 10Slyngshede: ldap:management rewrite modify-mfa to use Bitu. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [11:43:00] !log +100G to global/prometheus in eqiad [11:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P41693 and previous config saved to /var/cache/conftool/dbconfig/20221129-114341-ladsgroup.json [11:44:59] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:45:35] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [11:45:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T323907)', diff saved to https://phabricator.wikimedia.org/P41694 and previous config saved to /var/cache/conftool/dbconfig/20221129-114553-ladsgroup.json [11:46:01] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [11:46:33] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [11:46:36] (03PS4) 10Jgiannelos: restbase-beta: Change wikifeeds URI for deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/861821 (https://phabricator.wikimedia.org/T306068) [11:46:47] (03PS1) 10Muehlenhoff: Remove one more obsolete package after bullseye->bookworm upgrade [puppet] - 10https://gerrit.wikimedia.org/r/861829 [11:47:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb2003.codfw.wmnet [11:47:36] (03PS13) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [11:47:39] !log Drop scholarships database from m2 T243037 [11:47:42] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [11:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:45] T243037: Shutdown scholarships.wikimedia.org and archive project - https://phabricator.wikimedia.org/T243037 [11:49:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Joe) [11:50:16] (03CR) 10Vgutierrez: [C: 03+1] restbase-beta: Change wikifeeds URI for deployment prep (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861821 (https://phabricator.wikimedia.org/T306068) (owner: 10Jgiannelos) [11:50:26] (03CR) 10Jgiannelos: "Should be working now:" [puppet] - 10https://gerrit.wikimedia.org/r/861821 (https://phabricator.wikimedia.org/T306068) (owner: 10Jgiannelos) [11:51:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10Joe) @RLazarus you can proceed with the decommissioning steps whenever you're ready. The servers are still in rotation as of now, and will need to be depooled first. I m... [11:53:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb2003.codfw.wmnet [11:53:35] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [11:54:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P41695 and previous config saved to /var/cache/conftool/dbconfig/20221129-115401-marostegui.json [11:54:08] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet [11:55:23] (03CR) 10Muehlenhoff: [C: 03+2] archiva/piwik: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/854574 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:56:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:58:19] (03PS3) 10Jbond: utils/puppet-debugger: add small shell script to run puppet-debugger [puppet] - 10https://gerrit.wikimedia.org/r/860874 [11:58:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P41696 and previous config saved to /var/cache/conftool/dbconfig/20221129-115847-ladsgroup.json [11:59:14] (03CR) 10Jbond: [C: 03+2] utils/puppet-debugger: add small shell script to run puppet-debugger [puppet] - 10https://gerrit.wikimedia.org/r/860874 (owner: 10Jbond) [12:00:09] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 585978040 and 2484 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:01:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P41697 and previous config saved to /var/cache/conftool/dbconfig/20221129-120100-ladsgroup.json [12:01:40] (03CR) 10Vgutierrez: [C: 03+2] restbase-beta: Change wikifeeds URI for deployment prep [puppet] - 10https://gerrit.wikimedia.org/r/861821 (https://phabricator.wikimedia.org/T306068) (owner: 10Jgiannelos) [12:01:56] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove one more obsolete package after bullseye->bookworm upgrade [puppet] - 10https://gerrit.wikimedia.org/r/861829 (owner: 10Muehlenhoff) [12:02:07] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 2602 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:03:39] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:04:51] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:05:06] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:05:09] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:51] (03CR) 10Muehlenhoff: [C: 03+2] Remove one more obsolete package after bullseye->bookworm upgrade [puppet] - 10https://gerrit.wikimedia.org/r/861829 (owner: 10Muehlenhoff) [12:06:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T323907)', diff saved to https://phabricator.wikimedia.org/P41698 and previous config saved to /var/cache/conftool/dbconfig/20221129-120601-ladsgroup.json [12:06:03] (03PS2) 10Muehlenhoff: Remove one more obsolete package after bullseye->bookworm upgrade [puppet] - 10https://gerrit.wikimedia.org/r/861829 [12:06:08] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [12:09:07] PROBLEM - DPKG on grafana1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [12:09:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P41699 and previous config saved to /var/cache/conftool/dbconfig/20221129-120907-marostegui.json [12:11:45] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:13:35] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.240 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:13:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T322618)', diff saved to https://phabricator.wikimedia.org/P41700 and previous config saved to /var/cache/conftool/dbconfig/20221129-121354-ladsgroup.json [12:14:02] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [12:16:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P41701 and previous config saved to /var/cache/conftool/dbconfig/20221129-121606-ladsgroup.json [12:16:47] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P41702 and previous config saved to /var/cache/conftool/dbconfig/20221129-122108-ladsgroup.json [12:24:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T321126)', diff saved to https://phabricator.wikimedia.org/P41703 and previous config saved to /var/cache/conftool/dbconfig/20221129-122414-marostegui.json [12:24:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1132.eqiad.wmnet with reason: Maintenance [12:24:20] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/861834 [12:24:22] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [12:24:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1132.eqiad.wmnet with reason: Maintenance [12:24:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T321126)', diff saved to https://phabricator.wikimedia.org/P41704 and previous config saved to /var/cache/conftool/dbconfig/20221129-122436-marostegui.json [12:25:55] (03PS18) 10Btullis: Add a spark-operator chart and helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [12:26:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T321126)', diff saved to https://phabricator.wikimedia.org/P41705 and previous config saved to /var/cache/conftool/dbconfig/20221129-122657-marostegui.json [12:28:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:30:14] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/861834 (owner: 10Muehlenhoff) [12:31:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T323907)', diff saved to https://phabricator.wikimedia.org/P41706 and previous config saved to /var/cache/conftool/dbconfig/20221129-123113-ladsgroup.json [12:31:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [12:31:20] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [12:31:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2122.codfw.wmnet with reason: Maintenance [12:31:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T323907)', diff saved to https://phabricator.wikimedia.org/P41707 and previous config saved to /var/cache/conftool/dbconfig/20221129-123134-ladsgroup.json [12:35:08] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10MoritzMuehlenhoff) >>! In T323222#8424915, @Papaul wrote: > @MoritzMuehlenhoff unfortunately this server is out of warranty. I know :-) See https://phabricator.wikimedia.org/T323222#8401501 [12:36:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P41708 and previous config saved to /var/cache/conftool/dbconfig/20221129-123614-ladsgroup.json [12:37:35] RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:38:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:42:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P41709 and previous config saved to /var/cache/conftool/dbconfig/20221129-124203-marostegui.json [12:51:09] (03PS3) 10Jaime Nuche: create group for Release Engineering members [puppet] - 10https://gerrit.wikimedia.org/r/860836 [12:51:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T323907)', diff saved to https://phabricator.wikimedia.org/P41710 and previous config saved to /var/cache/conftool/dbconfig/20221129-125121-ladsgroup.json [12:51:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:51:29] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [12:51:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:51:56] (03Abandoned) 10Sergio Gimeno: refreshUserImpactData.php: Add minimum edit filter [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861818 (https://phabricator.wikimedia.org/T323958) (owner: 10Sergio Gimeno) [12:52:48] (03Abandoned) 10Sergio Gimeno: refreshUserImpactData.php: Add force and dry-run flags [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861817 (https://phabricator.wikimedia.org/T322541) (owner: 10Sergio Gimeno) [12:53:06] (03Abandoned) 10Sergio Gimeno: ImageRecommendation: End experiment [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861815 (https://phabricator.wikimedia.org/T323686) (owner: 10Sergio Gimeno) [12:53:40] (03Abandoned) 10Sergio Gimeno: NewImpact: Prepare experiment [extensions/GrowthExperiments] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861816 (https://phabricator.wikimedia.org/T323526) (owner: 10Sergio Gimeno) [12:54:22] (03CR) 10MSantos: [C: 03+2] wikifeeds: bump to 2022-11-28-160349-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/861461 (owner: 10MSantos) [12:55:13] (03PS1) 10Sergio Gimeno: refreshUserImpactData.php: Add minimum edit filter [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861838 (https://phabricator.wikimedia.org/T323958) [12:56:37] (03PS1) 10Sergio Gimeno: refreshUserImpactData.php: Add force and dry-run flags [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861468 (https://phabricator.wikimedia.org/T322541) [12:56:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:57:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P41711 and previous config saved to /var/cache/conftool/dbconfig/20221129-125710-marostegui.json [12:57:52] 10SRE, 10Wikimedia-Portals, 10Wikimedia-Site-requests, 10Security, 10Vuln-XSS: Malicious meta admin can add javascript to https://office.wikimedia.org/api/ . Move api listing off wiki - https://phabricator.wikimedia.org/T109147 (10Krinkle) Relevant patches for future reference: >>! In T273179#8401089, @... [12:58:30] (03PS1) 10Sergio Gimeno: NewImpact: Prepare experiment [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861469 (https://phabricator.wikimedia.org/T323526) [12:59:08] (03Merged) 10jenkins-bot: wikifeeds: bump to 2022-11-28-160349-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/861461 (owner: 10MSantos) [13:00:27] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:00:37] !log installing glibc security updates on buster [13:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:54] (03PS4) 10Jaime Nuche: create group for Release Engineering members [puppet] - 10https://gerrit.wikimedia.org/r/860836 [13:02:29] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:02:56] (03CR) 10Jaime Nuche: create group for Release Engineering members (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860836 (owner: 10Jaime Nuche) [13:03:57] (03PS5) 10Muehlenhoff: Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 [13:05:45] (03CR) 10CI reject: [V: 04-1] Add a new cookbook to roll-restart/reboot Swift proxies (also Thanos frontends) [cookbooks] - 10https://gerrit.wikimedia.org/r/856996 (owner: 10Muehlenhoff) [13:10:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T323907)', diff saved to https://phabricator.wikimedia.org/P41712 and previous config saved to /var/cache/conftool/dbconfig/20221129-131006-ladsgroup.json [13:10:14] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [13:11:41] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:12:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T321126)', diff saved to https://phabricator.wikimedia.org/P41713 and previous config saved to /var/cache/conftool/dbconfig/20221129-131216-marostegui.json [13:12:18] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:12:24] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:12:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1134.eqiad.wmnet with reason: Maintenance [13:12:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T321126)', diff saved to https://phabricator.wikimedia.org/P41714 and previous config saved to /var/cache/conftool/dbconfig/20221129-131238-marostegui.json [13:13:13] (03PS1) 10Slyngshede: Add utils file, for handy functions when working with LDAP. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/861840 [13:14:21] (03CR) 10Elukey: Rewrite as kubernetes operator/controller (031 comment) [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [13:14:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T321126)', diff saved to https://phabricator.wikimedia.org/P41715 and previous config saved to /var/cache/conftool/dbconfig/20221129-131459-marostegui.json [13:18:47] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10Jenkins: New Keyholder identity for RelEng deployments - https://phabricator.wikimedia.org/T324014 (10jnuche) [13:19:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/861823 (owner: 10Muehlenhoff) [13:19:20] 10SRE, 10Wikimedia-Portals, 10Wikimedia-Site-requests, 10Security, 10Vuln-XSS: Malicious meta admin can add javascript to https://office.wikimedia.org/api/ . Move api listing off wiki - https://phabricator.wikimedia.org/T109147 (10TheDJ) {meme, src="raptor-free"} [13:21:09] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [13:22:10] (03CR) 10David Caro: [C: 03+1] wmcs: openstack: lib: ensure_canary: fix changelist calculation (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861438 (owner: 10Arturo Borrero Gonzalez) [13:23:27] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10Papaul) @MoritzMuehlenhoff thanks for the update [13:23:47] (03CR) 10Kghbln: planet: Add ProWiki feed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861452 (owner: 10Kghbln) [13:24:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Papaul) ACK [13:24:24] (03PS14) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [13:24:32] (03CR) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [13:25:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P41717 and previous config saved to /var/cache/conftool/dbconfig/20221129-132513-ladsgroup.json [13:28:22] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10Jenkins: New Keyholder identity for RelEng deployments - https://phabricator.wikimedia.org/T324014 (10taavi) -1. Scoping resources by WMF team excludes volunteers from participating and often doesn't reflect reality anyway. > It's a... [13:28:36] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [13:28:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] wmcs: openstack: lib: ensure_canary: fix changelist calculation (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/861438 (owner: 10Arturo Borrero Gonzalez) [13:30:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:30:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P41718 and previous config saved to /var/cache/conftool/dbconfig/20221129-133005-marostegui.json [13:30:34] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for db120[4-5] - pt1979@cumin2002" [13:31:46] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for Envoy on releases* [puppet] - 10https://gerrit.wikimedia.org/r/861846 (https://phabricator.wikimedia.org/T135991) [13:32:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for db120[4-5] - pt1979@cumin2002" [13:32:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:33:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1204.mgmt.eqiad.wmnet with reboot policy FORCED [13:34:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1205.mgmt.eqiad.wmnet with reboot policy FORCED [13:34:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:35:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:39:56] 10SRE, 10Wikimedia-Site-requests: Edits are made from internal cluster IPs: en.wiktionary, pl.wikipedia, and other sites - https://phabricator.wikimedia.org/T324018 (10Urbanecm) [13:40:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P41719 and previous config saved to /var/cache/conftool/dbconfig/20221129-134019-ladsgroup.json [13:41:00] 10SRE, 10Wikimedia-Site-requests: Edits are made from internal cluster IPs: en.wiktionary, pl.wikipedia, and other sites - https://phabricator.wikimedia.org/T324018 (10Urbanecm) Note I've blocked the IPv6 addresses to stop the vandalism. This has user impact as of now. [13:41:56] 10SRE, 10Wikimedia-Site-requests: Edits are made from internal cluster IPs: en.wiktionary, pl.wikipedia, and other sites - https://phabricator.wikimedia.org/T324018 (10Urbanecm) p:05Triage→03Unbreak! Has user impact => UBN. [13:43:24] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10Jenkins: New Keyholder identity for RelEng deployments - https://phabricator.wikimedia.org/T324014 (10jnuche) @taavi, that is a valid concern. There is already a `jenkins-deploy` Unix user though, so to reduce confusion maybe we can... [13:44:14] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb: Bump postgres version for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/861823 (owner: 10Muehlenhoff) [13:44:16] 10SRE, 10Wikimedia-Site-requests: Edits are made from internal cluster IPs: en.wiktionary, pl.wikipedia, and other sites - https://phabricator.wikimedia.org/T324018 (10Urbanecm) p:05Unbreak!→03Triage Relevant servers were depooled. [13:44:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:45:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P41720 and previous config saved to /var/cache/conftool/dbconfig/20221129-134511-marostegui.json [13:45:33] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10Jenkins: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10jnuche) [13:45:48] (03CR) 10Herron: [C: 03+1] wmnet: move read traffic to graphite1005 [dns] - 10https://gerrit.wikimedia.org/r/861356 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [13:46:34] (03CR) 10Herron: [C: 03+1] graphite: move alerts to graphite1005 [puppet] - 10https://gerrit.wikimedia.org/r/861358 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [13:47:04] (03CR) 10Herron: [C: 03+1] stats: failover writes to graphite1005 [puppet] - 10https://gerrit.wikimedia.org/r/861359 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [13:48:31] (03CR) 10Herron: [C: 03+1] ProductionServices: move to graphite1005 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861361 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [13:49:00] (03CR) 10Herron: [C: 03+1] wmnet: move writes to graphite1005 [dns] - 10https://gerrit.wikimedia.org/r/861357 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [13:50:37] (03PS1) 10Majavah: reverse-proxy: Add eqiad e/f[1-4] subnets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861848 (https://phabricator.wikimedia.org/T324018) [13:50:50] jouncebot: nowandnext [13:50:51] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [13:50:51] In 0 hour(s) and 9 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T1400) [13:50:51] In 0 hour(s) and 9 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T1400) [13:50:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. I'll merge that later." [puppet] - 10https://gerrit.wikimedia.org/r/860836 (owner: 10Jaime Nuche) [13:52:40] 10SRE: Load IP ranges in reverse-proxies.php from Netbox/Puppet network module - https://phabricator.wikimedia.org/T324020 (10Urbanecm) [13:53:51] 10SRE: Load IP ranges in reverse-proxy.php from Netbox/Puppet network module - https://phabricator.wikimedia.org/T324020 (10Urbanecm) [13:55:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T323907)', diff saved to https://phabricator.wikimedia.org/P41721 and previous config saved to /var/cache/conftool/dbconfig/20221129-135526-ladsgroup.json [13:55:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [13:55:36] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [13:55:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2150.codfw.wmnet with reason: Maintenance [13:55:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T323907)', diff saved to https://phabricator.wikimedia.org/P41722 and previous config saved to /var/cache/conftool/dbconfig/20221129-135549-ladsgroup.json [13:56:28] (03CR) 10Urbanecm: reverse-proxy: Add eqiad e/f[1-4] subnets (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861848 (https://phabricator.wikimedia.org/T324018) (owner: 10Majavah) [13:57:24] (03PS1) 10Filippo Giunchedi: varnish: teach confd-reload-vcl to write a Prometheus state file [puppet] - 10https://gerrit.wikimedia.org/r/861850 (https://phabricator.wikimedia.org/T314118) [13:57:27] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for prometheus-ipmi-exporter [puppet] - 10https://gerrit.wikimedia.org/r/860569 (https://phabricator.wikimedia.org/T135991) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T1400). nyaa~ [14:00:05] sergi0: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T1400) [14:00:15] hello [14:00:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T321126)', diff saved to https://phabricator.wikimedia.org/P41723 and previous config saved to /var/cache/conftool/dbconfig/20221129-140018-marostegui.json [14:00:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:00:21] o/ [14:00:23] hi sergi0 [14:00:26] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:00:27] I can deploy today [14:00:29] but before i do... [14:00:38] * Lucas_WMDE blames TheresNoTime for jouncebot’s message [14:00:43] taavi: should we first do your fix for T324018? [14:00:43] T324018: Edits are made from internal cluster IPs: en.wiktionary, pl.wikipedia, and other sites - https://phabricator.wikimedia.org/T324018 [14:00:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1135.eqiad.wmnet with reason: Maintenance [14:00:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T321126)', diff saved to https://phabricator.wikimedia.org/P41724 and previous config saved to /var/cache/conftool/dbconfig/20221129-140050-marostegui.json [14:00:56] or can we go ahead with B&C, considering the appservers are no longer pooled? [14:01:15] urbanecm: I think you can go ahead. I'd like to get someone to double check my fix before deploying [14:01:24] yeah, i think so too [14:01:25] proceeding [14:01:41] (03CR) 10Urbanecm: [C: 03+2] NewImpact: Prepare experiment [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861469 (https://phabricator.wikimedia.org/T323526) (owner: 10Sergio Gimeno) [14:01:43] (03CR) 10Urbanecm: [C: 03+2] refreshUserImpactData.php: Add force and dry-run flags [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861468 (https://phabricator.wikimedia.org/T322541) (owner: 10Sergio Gimeno) [14:01:46] (03CR) 10Urbanecm: [C: 03+2] refreshUserImpactData.php: Add minimum edit filter [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861838 (https://phabricator.wikimedia.org/T323958) (owner: 10Sergio Gimeno) [14:03:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T321126)', diff saved to https://phabricator.wikimedia.org/P41725 and previous config saved to /var/cache/conftool/dbconfig/20221129-140311-marostegui.json [14:03:19] (03CR) 10Urbanecm: [C: 04-1] "I've backported few GrowthExperiments patches to wmf.12. This commit's no longer accurate. Should be regenerated." [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [14:04:29] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 6911431928 and 9944 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:05:39] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 626279880 and 83 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:05:51] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 11580236864 and 10025 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:06:22] (03PS1) 10Zabe: Start writing to cul_actor on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861853 (https://phabricator.wikimedia.org/T233004) [14:07:43] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 4680 and 10 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:07:54] urbanecm, may I add that ^ patch to the window? [14:07:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:08:03] zabe: sure [14:08:08] !log mbsantos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [14:08:35] !log mbsantos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [14:09:05] added to the calender [14:09:53] !log mbsantos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [14:10:38] !log mbsantos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [14:11:42] !log mbsantos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [14:12:00] (03CR) 10Volans: [C: 03+1] "The network prefixes LGTM and match what's in Netbox. At the moment there isn't any row E/F public VLAN so those should cover all the case" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861848 (https://phabricator.wikimedia.org/T324018) (owner: 10Majavah) [14:12:31] !log mbsantos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [14:12:39] urbanecm: ^ can I deploy that now since the backports are in CI? [14:12:49] taavi: go ahead! [14:12:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861848 (https://phabricator.wikimedia.org/T324018) (owner: 10Majavah) [14:13:01] <_joe_> so, this could've happened if we had a cache server in row e/f too [14:14:03] yese [14:14:14] (03Merged) 10jenkins-bot: reverse-proxy: Add eqiad e/f[1-4] subnets [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861848 (https://phabricator.wikimedia.org/T324018) (owner: 10Majavah) [14:14:29] !log taavi@deploy1002 Started scap: Backport for [[gerrit:861848|reverse-proxy: Add eqiad e/f[1-4] subnets (T324018)]] [14:14:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:14:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:14:33] (03CR) 10Vgutierrez: [C: 04-1] varnish: teach confd-reload-vcl to write a Prometheus state file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861850 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [14:14:36] T324018: Edits are made from internal cluster IPs: en.wiktionary, pl.wikipedia, and other sites - https://phabricator.wikimedia.org/T324018 [14:14:38] and will happen with 5-8 when they are taken into service before that's automated [14:15:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:15:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:16:02] !log taavi@deploy1002 taavi and taavi: Backport for [[gerrit:861848|reverse-proxy: Add eqiad e/f[1-4] subnets (T324018)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:16:27] hmm. not sure how to test this one [14:16:38] since the mwdebug servers aren't affected [14:16:55] well at least it doesn't break what already works, so I'm syncing it [14:17:03] taavi: I'd sync, and crash-test it after. [14:17:31] I can bump the block to a hardblock, and then we can refresh the edit form a couple of times to verify you aren't affected by the block. [14:17:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:18:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P41726 and previous config saved to /var/cache/conftool/dbconfig/20221129-141818-marostegui.json [14:18:46] I don't have IPv6 at home.. but I guess that doesn't matter since I suspect this is like the envoy->apache hop or something like that [14:18:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:19:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:19:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:20:11] _joe_: now scap shows me lots of errors like this one when syncing https://phabricator.wikimedia.org/P41727 [14:20:19] 14:18:48 10 apaches had sync errors [14:20:39] (03Merged) 10jenkins-bot: NewImpact: Prepare experiment [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861469 (https://phabricator.wikimedia.org/T323526) (owner: 10Sergio Gimeno) [14:20:50] (03Merged) 10jenkins-bot: refreshUserImpactData.php: Add force and dry-run flags [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861468 (https://phabricator.wikimedia.org/T322541) (owner: 10Sergio Gimeno) [14:20:50] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:20:51] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1058752 and 797 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:20:51] <_joe_> taavi: unrelated to my change heh [14:21:02] <_joe_> can you try to sync again? [14:21:16] let's see [14:21:32] I'll try once the current round of php-fpm restarts finish [14:22:03] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:861848|reverse-proxy: Add eqiad e/f[1-4] subnets (T324018)]] (duration: 07m 33s) [14:22:11] T324018: Edits are made from internal cluster IPs: en.wiktionary, pl.wikipedia, and other sites - https://phabricator.wikimedia.org/T324018 [14:22:29] !log taavi@deploy1002 Started scap: re-syncing the backport to see if the errors fix themself [14:22:34] (03PS1) 10Jbond: WIP: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 [14:24:10] (03CR) 10CI reject: [V: 04-1] WIP: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (owner: 10Jbond) [14:24:19] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1985640 and 1007 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:24:28] failing this time too, looks like it affects scap proxies only [14:24:31] so this is probably a scap bug [14:25:06] um, no [14:25:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:26:12] sergi0: fyi, since wmf.12 is not at any wikis, there won't be anything for you to test. so, once it merges, it'll be all done for you. [14:26:20] it's something weird related to the new hosts, but I'm not quite sure why [14:26:27] (03CR) 10Muehlenhoff: Allow multiple server connections to be defined. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860857 (owner: 10Slyngshede) [14:26:57] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:27] !log taavi@deploy1002 Finished scap: re-syncing the backport to see if the errors fix themself (duration: 04m 58s) [14:27:29] (03PS2) 10Jbond: WIP: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 [14:28:13] urbanecm: yep, just here to react if there were questions about the changes. Thank you! [14:28:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:28:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:28:40] yep, nothing from me, just waiting for everything to merge :) [14:29:04] (03CR) 10CI reject: [V: 04-1] WIP: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (owner: 10Jbond) [14:29:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:29:44] (03CR) 10Jbond: [C: 03+1] Enable profile::auto_restarts::service for prometheus-ipmi-exporter [puppet] - 10https://gerrit.wikimedia.org/r/860569 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:29:50] taavi: it almost looks like scap goes through the same server more than once...? [14:30:48] urbanecm: not sure, filed T324023 [14:30:48] T324023: scap fails to sync some new hosts - https://phabricator.wikimedia.org/T324023 [14:30:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T323907)', diff saved to https://phabricator.wikimedia.org/P41728 and previous config saved to /var/cache/conftool/dbconfig/20221129-143049-ladsgroup.json [14:31:18] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [14:32:29] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 61 probes of 791 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:32:30] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Jclark-ctr) @Papaul Verified they are connected [14:32:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:32:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:32:49] _joe_: you can re-pool the hosts now, btw [14:33:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P41729 and previous config saved to /var/cache/conftool/dbconfig/20221129-143324-marostegui.json [14:33:39] <_joe_> taavi: you're still having issues running scap? [14:34:04] I am, weirdly enough with mw148X and mw149X (these new hosts) hosts only [14:34:20] <_joe_> that's not what was in your previous report [14:34:28] <_joe_> the one in the task I mean [14:34:29] but I want to try the sync after re-pooling, if there's something weird with them [14:34:32] <_joe_> it's host all over [14:34:42] <_joe_> taavi: no I think there is one possible explanation [14:34:43] wdym? "This seems to only affect some of the servers added in T313327: Put mw14[57-98] in production." [14:34:44] T313327: Put mw14[57-98] in production - https://phabricator.wikimedia.org/T313327 [14:34:54] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.5 point update - https://phabricator.wikimedia.org/T317416 (10MoritzMuehlenhoff) [14:34:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:34:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:35:13] the list of servers in the command line is scap passing the list of proxies when pulling, not the servers which are failing [14:35:15] <_joe_> taavi: let me test one thing, heh [14:35:21] yes? [14:35:40] <_joe_> uff this could be a serious issue [14:37:09] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:22] (03Merged) 10jenkins-bot: refreshUserImpactData.php: Add minimum edit filter [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861838 (https://phabricator.wikimedia.org/T323958) (owner: 10Sergio Gimeno) [14:37:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:37:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:37:43] urbanecm: ^ please wait a bit before continuing with the backports [14:37:44] (03CR) 10Daniel Kinzler: "Patch to fix the test failure: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/861859" [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [14:37:48] yes, waiting [14:37:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:38:07] the wmf. backports are essentially done though, wmf.12's not at deployment host [14:38:16] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: byte/str mismatch TypeError when converting any STL file - https://phabricator.wikimedia.org/T323781 (10hnowlan) 05Open→03Resolved [14:38:20] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan) [14:39:07] it is there, just without any of the extensions as the branch commit failed to merge [14:39:29] my backport pulled it there, I guess [14:39:31] well, yes, so, GE code is not at deployment host :) [14:39:50] (I CR-1'ed the branch commit instead to highlight it) [14:40:13] <_joe_> I hope the change I'm doing is enough [14:40:28] what are you doing? [14:40:54] <_joe_> adding scap proxies in those rows [14:41:02] <_joe_> because it's not a firewall issue [14:41:11] <_joe_> I can scap pull from those nodes with no issues [14:41:38] (03PS1) 10Giuseppe Lavagetto: scap: add proxies in row E and F [puppet] - 10https://gerrit.wikimedia.org/r/861861 (https://phabricator.wikimedia.org/T324023) [14:42:14] <_joe_> taavi: I think the problem is in the heuristics scap uses to find the nearest proxy [14:42:29] hmm, will it work like that or will it need a proxy in each rack since they all have their own subnets now? [14:42:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: add proxies in row E and F [puppet] - 10https://gerrit.wikimedia.org/r/861861 (https://phabricator.wikimedia.org/T324023) (owner: 10Giuseppe Lavagetto) [14:42:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:43:07] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1204.mgmt.eqiad.wmnet with reboot policy FORCED [14:43:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1205.mgmt.eqiad.wmnet with reboot policy FORCED [14:44:13] <_joe_> taavi: that is what I fear, yes [14:44:20] oh, I think I know why it's crashing. it loops through ipv4 and ipv6 and tries to delete a host-specific key for both address families [14:45:30] <_joe_> taavi: can you try to re-sync again? [14:45:45] sure [14:45:47] !log taavi@deploy1002 Started scap: testing a scap sync [14:45:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P41730 and previous config saved to /var/cache/conftool/dbconfig/20221129-144556-ladsgroup.json [14:46:21] <_joe_> taavi: let's hope it doesn't fail on mw1494-5 [14:46:31] <_joe_> else I'll need to add more scap proxies for now [14:46:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT deployments) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:47:11] (03PS1) 10Daniel Kinzler: Fix LanguageVariantConverter test [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861471 (https://phabricator.wikimedia.org/T323985) [14:47:26] <_joe_> taavi: uhm still not finished to sync? [14:48:14] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:48:23] _joe_: failed on 8 hosts, give me a second and I'll paste their hostnames for you [14:48:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T321126)', diff saved to https://phabricator.wikimedia.org/P41731 and previous config saved to /var/cache/conftool/dbconfig/20221129-144831-marostegui.json [14:48:33] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:48:39] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:48:44] <_joe_> so failed on every host but the two I added, wtf it's even worse than I thought [14:48:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:48:48] <_joe_> topranks: ^^ [14:49:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1140.eqiad.wmnet with reason: Maintenance [14:49:05] <_joe_> this is a serious issue with row E/F and scap it seems [14:49:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:49:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [14:49:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1140.eqiad.wmnet with reason: Maintenance [14:49:28] * topranks looking [14:49:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:49:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:49:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:49:52] _joe_: so I still suspect it's a scap bug [14:49:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T321126)', diff saved to https://phabricator.wikimedia.org/P41732 and previous config saved to /var/cache/conftool/dbconfig/20221129-144952-marostegui.json [14:50:37] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 20 probes of 791 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:51:04] !log taavi@deploy1002 Finished scap: testing a scap sync (duration: 05m 17s) [14:51:08] zabe: it seems we won't have time for your config patch, due to scap issues. can you reschedule it please? [14:51:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:51:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:52:14] sure sure, already guessed that [14:53:56] Heads up - I'm about to shut down 7 Hadoop worker nodes for hardware maintenance. I'm downtiming them in Icinga, but we may get some aggregate alarms fired from Alertmanager. [14:54:19] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on 6 hosts with reason: replacing RAID controller battery [14:54:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on 6 hosts with reason: replacing RAID controller battery [14:54:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c74eeb70-b29b-4aff-94c9-af5dbbe99cbd) set by btullis@cumin1001 for 6:00:00 on 6 h... [14:55:11] (03CR) 10Jelto: sre.gitlab.upgrade: add cookbook to upgrade GitLab version (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [14:55:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T321126)', diff saved to https://phabricator.wikimedia.org/P41734 and previous config saved to /var/cache/conftool/dbconfig/20221129-145513-marostegui.json [14:55:20] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:55:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:56:02] (03CR) 10Daniel Kinzler: Branch commit for wmf/1.40.0-wmf.12 (031 comment) [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [14:58:24] !log oblivian@deploy1002 Synchronized wmf-config/reverse-proxy.php: test deployment (duration: 04m 13s) [15:00:04] !log oblivian@cumin1001 conftool action : set/pooled=inactive; selector: dc=eqiad,name=mw14(89|9).* [15:00:26] !log removing /srv/cassandra on all maps hosts [15:00:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:01:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P41735 and previous config saved to /var/cache/conftool/dbconfig/20221129-150103-ladsgroup.json [15:01:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:01:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:02:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:03:03] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1204'] [15:03:42] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1205'] [15:05:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Papaul) [15:05:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:05:35] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:05:37] PROBLEM - Host an-worker1089 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:49] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on an-worker1089.eqiad.wmnet with reason: replacing RAID controller battery [15:07:04] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on an-worker1089.eqiad.wmnet with reason: replacing RAID controller battery [15:07:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=e8e1fd16-0d7d-47b2-8304-a9cb280e0cc5) set by btullis@cumin1001 for 6:00:00 on 1 h... [15:08:21] PROBLEM - mediawiki-installation DSH group on mw1491 is CRITICAL: Host mw1491 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:09:03] ACKNOWLEDGEMENT - Host an-worker1079 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:09:03] ACKNOWLEDGEMENT - Host an-worker1083 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:09:03] ACKNOWLEDGEMENT - Host an-worker1085 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:09:04] ACKNOWLEDGEMENT - Host an-worker1089 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:09:05] ACKNOWLEDGEMENT - Host an-worker1090 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:09:06] ACKNOWLEDGEMENT - Host an-worker1093 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:09:07] ACKNOWLEDGEMENT - Host an-worker1094 is DOWN: PING CRITICAL - Packet loss = 100% Btullis Intentional downtime for RAID battery replacement [15:10:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P41737 and previous config saved to /var/cache/conftool/dbconfig/20221129-151020-marostegui.json [15:10:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:13:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10BTullis) Thanks @Jclark-ctr - All hosts are shut down and ready for replacement. {F35824185,width=70%} Feel free to replace the batteries and res... [15:15:22] (03CR) 10Volans: [C: 04-1] "LGTM but has a typo" [puppet] - 10https://gerrit.wikimedia.org/r/860902 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [15:15:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:16:01] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T323907)', diff saved to https://phabricator.wikimedia.org/P41739 and previous config saved to /var/cache/conftool/dbconfig/20221129-151609-ladsgroup.json [15:16:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [15:16:18] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [15:16:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2159.codfw.wmnet with reason: Maintenance [15:16:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:16:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:16:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T323907)', diff saved to https://phabricator.wikimedia.org/P41740 and previous config saved to /var/cache/conftool/dbconfig/20221129-151647-ladsgroup.json [15:17:11] (03CR) 10FNegri: WIP: idea for cloud cumin::target (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861855 (owner: 10Jbond) [15:18:01] PROBLEM - mediawiki-installation DSH group on mw1495 is CRITICAL: Host mw1495 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:19:59] PROBLEM - mediawiki-installation DSH group on mw1490 is CRITICAL: Host mw1490 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:20:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1205'] [15:25:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P41741 and previous config saved to /var/cache/conftool/dbconfig/20221129-152526-marostegui.json [15:25:49] !log set thanos ring replicas to 3.0 T311690 [15:25:51] PROBLEM - mediawiki-installation DSH group on mw1489 is CRITICAL: Host mw1489 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:25:51] PROBLEM - mediawiki-installation DSH group on mw1492 is CRITICAL: Host mw1492 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:11] PROBLEM - Host an-worker1090.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:26:19] (03CR) 10Filippo Giunchedi: varnish: teach confd-reload-vcl to write a Prometheus state file (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861850 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [15:26:24] T311690: Shorten Thanos retention - https://phabricator.wikimedia.org/T311690 [15:26:24] (03PS2) 10Filippo Giunchedi: varnish: teach confd-reload-vcl to write a Prometheus state file [puppet] - 10https://gerrit.wikimedia.org/r/861850 (https://phabricator.wikimedia.org/T314118) [15:28:00] jouncebot: next [15:28:00] In 1 hour(s) and 31 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T1700) [15:28:09] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: move read traffic to graphite1005 [dns] - 10https://gerrit.wikimedia.org/r/861356 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [15:28:13] (03PS2) 10Filippo Giunchedi: wmnet: move read traffic to graphite1005 [dns] - 10https://gerrit.wikimedia.org/r/861356 (https://phabricator.wikimedia.org/T318903) [15:29:18] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuration (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [15:30:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 10%: Maint done', diff saved to https://phabricator.wikimedia.org/P41742 and previous config saved to /var/cache/conftool/dbconfig/20221129-153049-ladsgroup.json [15:31:15] PROBLEM - mediawiki-installation DSH group on mw1493 is CRITICAL: Host mw1493 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:31:15] PROBLEM - mediawiki-installation DSH group on mw1496 is CRITICAL: Host mw1496 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:31:15] PROBLEM - mediawiki-installation DSH group on mw1494 is CRITICAL: Host mw1494 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:31:15] PROBLEM - mediawiki-installation DSH group on mw1498 is CRITICAL: Host mw1498 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:31:15] PROBLEM - mediawiki-installation DSH group on mw1497 is CRITICAL: Host mw1497 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [15:32:38] 10SRE, 10Wikimedia-Site-requests: Edits are made from internal cluster IPs: en.wiktionary, pl.wikipedia, and other sites - https://phabricator.wikimedia.org/T324018 (10TheresNoTime) Snrk, [[ https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-mediawiki-1-7.0.0-1-2022.11.29?id=pMF2w4QBW_7Siu4Bw... [15:33:13] 10SRE, 10Wikimedia-Site-requests: Edits are made from internal cluster IPs: en.wiktionary, pl.wikipedia, and other sites - https://phabricator.wikimedia.org/T324018 (10taavi) 05Open→03Resolved [15:34:29] (03PS4) 10JMeybohm: Rewrite as kubernetes operator/controller [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) [15:34:31] (03PS4) 10JMeybohm: update vendor [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861353 (https://phabricator.wikimedia.org/T323706) [15:37:01] (03CR) 10JMeybohm: Rewrite as kubernetes operator/controller (031 comment) [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [15:37:05] (03PS2) 10Jcrespo: mediabackups: Add new policy intended for admin deletion of files [puppet] - 10https://gerrit.wikimedia.org/r/860838 (https://phabricator.wikimedia.org/T323796) [15:38:13] (03CR) 10JMeybohm: [C: 03+2] felix: Instruct felix to set the src parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/859586 (owner: 10Alexandros Kosiaris) [15:39:08] (03PS5) 10Jbond: spicerack: add monitoring for sre.puppet.netbox-sync [puppet] - 10https://gerrit.wikimedia.org/r/860019 [15:40:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T321126)', diff saved to https://phabricator.wikimedia.org/P41743 and previous config saved to /var/cache/conftool/dbconfig/20221129-154033-marostegui.json [15:40:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1184.eqiad.wmnet with reason: Maintenance [15:40:41] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [15:40:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1184.eqiad.wmnet with reason: Maintenance [15:40:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T321126)', diff saved to https://phabricator.wikimedia.org/P41744 and previous config saved to /var/cache/conftool/dbconfig/20221129-154055-marostegui.json [15:40:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) [15:40:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:41:50] (03CR) 10Volans: "It looks much better! Answers and minor nits inline, it's much closer to be ready." [cookbooks] - 10https://gerrit.wikimedia.org/r/858999 (https://phabricator.wikimedia.org/T323569) (owner: 10Jelto) [15:42:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1204'] [15:43:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T321126)', diff saved to https://phabricator.wikimedia.org/P41745 and previous config saved to /var/cache/conftool/dbconfig/20221129-154316-marostegui.json [15:43:17] (03Merged) 10jenkins-bot: felix: Instruct felix to set the src parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/859586 (owner: 10Alexandros Kosiaris) [15:43:26] (03PS3) 10Jcrespo: mediabackups: Add new policy intended for admin deletion of files [puppet] - 10https://gerrit.wikimedia.org/r/860838 (https://phabricator.wikimedia.org/T323796) [15:43:41] (03CR) 10Jbond: "i have updated the logic so that there is some grace time between data missing and the error occurring, see inline for other comments" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [15:44:21] PROBLEM - Host an-worker1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:44:39] RECOVERY - Host an-worker1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [15:45:22] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1204'] [15:45:35] (03CR) 10C. Scott Ananian: [C: 03+2] Fix LanguageVariantConverter test [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861471 (https://phabricator.wikimedia.org/T323985) (owner: 10Daniel Kinzler) [15:45:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: Maint done', diff saved to https://phabricator.wikimedia.org/P41746 and previous config saved to /var/cache/conftool/dbconfig/20221129-154554-ladsgroup.json [15:45:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:46:35] (03PS1) 10C. Scott Ananian: Bump parsoid to 0.17.0-a7 [vendor] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861472 (https://phabricator.wikimedia.org/T323479) [15:47:06] (03CR) 10Subramanya Sastry: [C: 03+2] Bump parsoid to 0.17.0-a7 [vendor] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861472 (https://phabricator.wikimedia.org/T323479) (owner: 10C. Scott Ananian) [15:47:18] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1204'] [15:50:19] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Add new policy intended for admin deletion of files [puppet] - 10https://gerrit.wikimedia.org/r/860838 (https://phabricator.wikimedia.org/T323796) (owner: 10Jcrespo) [15:50:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) [15:50:42] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM for VRTS - https://phabricator.wikimedia.org/T324030 (10Arnoldokoth) [15:51:49] PROBLEM - SSH on mw1331.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:54:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T323907)', diff saved to https://phabricator.wikimedia.org/P41747 and previous config saved to /var/cache/conftool/dbconfig/20221129-155401-ladsgroup.json [15:54:09] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [15:56:31] PROBLEM - Host an-worker1085.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:56:39] RECOVERY - Host an-worker1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [15:57:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) [15:57:32] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for dragonfly-supernode [puppet] - 10https://gerrit.wikimedia.org/r/861888 (https://phabricator.wikimedia.org/T135991) [15:58:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P41748 and previous config saved to /var/cache/conftool/dbconfig/20221129-155822-marostegui.json [15:58:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Papaul) [15:58:58] !log oblivian@cumin1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw14(89|9).* [15:59:13] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Marostegui) Is this something @Papaul can finish? [16:00:13] (03Merged) 10jenkins-bot: Fix LanguageVariantConverter test [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861471 (https://phabricator.wikimedia.org/T323985) (owner: 10Daniel Kinzler) [16:01:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: Maint done', diff saved to https://phabricator.wikimedia.org/P41749 and previous config saved to /var/cache/conftool/dbconfig/20221129-160059-ladsgroup.json [16:01:14] (03PS1) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 [16:02:34] PROBLEM - Host an-worker1083.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:02:37] RECOVERY - Host an-worker1085.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms [16:03:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:03:02] (03Merged) 10jenkins-bot: Bump parsoid to 0.17.0-a7 [vendor] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861472 (https://phabricator.wikimedia.org/T323479) (owner: 10C. Scott Ananian) [16:03:17] (03CR) 10Ssingh: [V: 03+1] P:cache::haproxy: harden systemd unit (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [16:03:43] (03PS2) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 [16:04:11] !log oblivian@deploy1002 Synchronized wmf-config/reverse-proxy.php: test deployment (duration: 04m 36s) [16:05:16] 10SRE, 10Infrastructure-Foundations, 10netops: ICMPv6 'TTL Exceeded' messages are not generated by row E/F switches due to loopback filter - https://phabricator.wikimedia.org/T324033 (10cmooney) p:05Triage→03Medium [16:05:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) [16:06:40] (03PS2) 10Ssingh: P:cache::haproxy: harden systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) [16:06:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:06:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:06:50] (03PS1) 10Jcrespo: install_server: Add db1204, db1205 to the config to wipe disks on 1st install [puppet] - 10https://gerrit.wikimedia.org/r/861893 (https://phabricator.wikimedia.org/T313978) [16:06:53] (03PS3) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 [16:07:37] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38485/console" [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [16:08:39] RECOVERY - Host an-worker1083.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [16:08:56] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [16:08:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:08:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:09:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P41750 and previous config saved to /var/cache/conftool/dbconfig/20221129-160907-ladsgroup.json [16:09:13] (03CR) 10Ssingh: [V: 03+1] "> → Overall exposure level for haproxy.service: 3.4 OK 🙂" [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [16:09:14] !log oblivian@deploy1002 Synchronized wmf-config/reverse-proxy.php: test deployment (duration: 04m 35s) [16:09:21] RECOVERY - mediawiki-installation DSH group on mw1491 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:11:42] !log robh@cumin2002 START - Cookbook sre.dns.netbox [16:11:46] 10SRE, 10Infrastructure-Foundations, 10netops: ICMPv6 'TTL Exceeded' messages are not generated by row E/F switches due to loopback filter - https://phabricator.wikimedia.org/T324033 (10cmooney) [16:12:14] (03Abandoned) 10Muehlenhoff: Make ganeti2031 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/835553 (https://phabricator.wikimedia.org/T313857) (owner: 10Muehlenhoff) [16:12:29] (03CR) 10Hashar: "Jaime and I are pairing to rebase this change on top of the latest version of mediawiki/core@wmf/1.40.0-wmf.12 and updating the submodules" [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [16:12:33] (03PS2) 10Jcrespo: install_server: Add db1204, db1205 to the config to wipe disks on setup [puppet] - 10https://gerrit.wikimedia.org/r/861893 (https://phabricator.wikimedia.org/T313978) [16:12:36] !log oblivian@cumin1001 conftool action : set/pooled=yes; selector: dc=eqiad,name=mw14(89|9).* [16:13:07] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184', diff saved to https://phabricator.wikimedia.org/P41751 and previous config saved to /var/cache/conftool/dbconfig/20221129-161329-marostegui.json [16:13:35] !log robh@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqsin hosts - robh@cumin2002" [16:13:53] !log oblivian@deploy1002 Synchronized wmf-config/reverse-proxy.php: test deployment (duration: 04m 28s) [16:13:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) [16:14:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:14:21] (03PS4) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 [16:14:38] !log robh@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: eqsin hosts - robh@cumin2002" [16:14:38] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:43] PROBLEM - Host an-worker1079.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:14:55] (03CR) 10Filippo Giunchedi: [C: 04-1] "I think the logic won't work as expected, unless I'm missing something" [puppet] - 10https://gerrit.wikimedia.org/r/860019 (owner: 10Jbond) [16:15:29] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3440779304 and 2213 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:15:31] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5021 [16:16:00] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5021 [16:16:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: Maint done', diff saved to https://phabricator.wikimedia.org/P41752 and previous config saved to /var/cache/conftool/dbconfig/20221129-161604-ladsgroup.json [16:16:19] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5022 [16:16:44] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5022 [16:16:49] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5023 [16:16:57] (03PS1) 10Papaul: Add new db node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/861894 (https://phabricator.wikimedia.org/T313978) [16:17:11] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5023 [16:17:15] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5024 [16:17:17] RECOVERY - Host an-worker1079.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms [16:17:31] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 8 and 2336 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [16:17:39] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5024 [16:17:44] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5025 [16:17:56] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:17:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:18:07] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5025 [16:18:11] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5026 [16:18:36] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5026 [16:18:41] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5027 [16:18:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:19:03] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5027 [16:19:05] RECOVERY - mediawiki-installation DSH group on mw1495 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:19:10] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host lvs5004 [16:19:42] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host lvs5004 [16:19:52] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti5004 [16:20:29] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti5004 [16:20:39] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host dns5004 [16:20:49] PROBLEM - Host an-worker1093.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:21:01] RECOVERY - mediawiki-installation DSH group on mw1490 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:21:08] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host dns5004 [16:21:37] (03CR) 10Papaul: [C: 03+2] Add new db node to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/861894 (https://phabricator.wikimedia.org/T313978) (owner: 10Papaul) [16:23:02] (03Abandoned) 10Jcrespo: install_server: Add db1204, db1205 to the config to wipe disks on setup [puppet] - 10https://gerrit.wikimedia.org/r/861893 (https://phabricator.wikimedia.org/T313978) (owner: 10Jcrespo) [16:23:13] (03PS5) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 [16:23:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Papaul) [16:23:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) [16:23:36] !log sukhe@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for 16 hosts [16:23:42] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for 16 hosts [16:23:54] (03PS2) 10Jaime Nuche: Branch commit for wmf/1.40.0-wmf.12 [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [16:23:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:24:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P41753 and previous config saved to /var/cache/conftool/dbconfig/20221129-162414-ladsgroup.json [16:24:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Papaul) Waiting on John to connected those servers into 1G port since there are connected to 10G port so i can redo the switch configuration and... [16:24:43] (03PS1) 10Cathal Mooney: Remove VRF-specific loopback filter from row E/F switches [homer/public] - 10https://gerrit.wikimedia.org/r/861896 (https://phabricator.wikimedia.org/T324033) [16:25:53] (03CR) 10JMeybohm: "All these versions are so confusing 😞" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/861349 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [16:26:21] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 327 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:26:53] RECOVERY - mediawiki-installation DSH group on mw1489 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:26:53] RECOVERY - mediawiki-installation DSH group on mw1492 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:26:54] RECOVERY - Host an-worker1093.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [16:27:01] PROBLEM - Host an-worker1094.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:27:58] (03PS1) 10Ebernhardson: cirrus: Enable document size limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861897 (https://phabricator.wikimedia.org/T323687) [16:28:13] (03CR) 10Jaime Nuche: [C: 03+1] "I think it should be ready to go now." [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [16:28:23] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:28:33] (03CR) 10Hashar: [C: 03+2] "Ahmon, that fix the wmf/1.40.0-wmf.12 branch" [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [16:28:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T321126)', diff saved to https://phabricator.wikimedia.org/P41754 and previous config saved to /var/cache/conftool/dbconfig/20221129-162835-marostegui.json [16:28:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1186.eqiad.wmnet with reason: Maintenance [16:28:41] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: ICMPv6 'TTL Exceeded' messages are not generated by row E/F switches due to loopback filter - https://phabricator.wikimedia.org/T324033 (10cmooney) For the record after comparing the loopback fitlters on lo0.0 (common-loopback.pol) and lo0.... [16:28:43] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [16:28:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1186.eqiad.wmnet with reason: Maintenance [16:28:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T321126)', diff saved to https://phabricator.wikimedia.org/P41755 and previous config saved to /var/cache/conftool/dbconfig/20221129-162857-marostegui.json [16:31:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T321126)', diff saved to https://phabricator.wikimedia.org/P41756 and previous config saved to /var/cache/conftool/dbconfig/20221129-163118-marostegui.json [16:31:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) [16:31:53] PROBLEM - swift eqiad object availability low on alert1001 is CRITICAL: cluster=thanos instance=thanos-fe1001 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad [16:32:17] RECOVERY - mediawiki-installation DSH group on mw1493 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:32:17] RECOVERY - mediawiki-installation DSH group on mw1496 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:32:17] RECOVERY - mediawiki-installation DSH group on mw1498 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:32:17] RECOVERY - mediawiki-installation DSH group on mw1494 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:32:17] RECOVERY - mediawiki-installation DSH group on mw1497 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [16:33:32] (03PS6) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 [16:34:14] (03CR) 10Hnowlan: [C: 04-1] "Mostly lgtm, one fix needed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/861401 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman) [16:35:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Shared-Data-Infrastructure: Multiple RAID battery failures on hadoop worker hosts - https://phabricator.wikimedia.org/T318659 (10Jclark-ctr) Finished replacing raid battery on an-worker1079 an-worker1083 an-worker1085 an-worker1089 an-worker1090 an-worker1093 an-work... [16:36:58] (03PS2) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [16:37:29] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5021'] [16:37:47] (03PS7) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 [16:37:49] (03CR) 10CI reject: [V: 04-1] WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [16:37:52] !log robh@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['cp5021'] [16:38:23] (03CR) 10CI reject: [V: 04-1] Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 (owner: 10Andrew Bogott) [16:38:37] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5021.mgmt.eqsin.wmnet with reboot policy FORCED [16:38:52] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5022.mgmt.eqsin.wmnet with reboot policy FORCED [16:39:19] RECOVERY - Host an-worker1094.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.64 ms [16:39:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T323907)', diff saved to https://phabricator.wikimedia.org/P41757 and previous config saved to /var/cache/conftool/dbconfig/20221129-163921-ladsgroup.json [16:39:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [16:39:28] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [16:39:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [16:39:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41758 and previous config saved to /var/cache/conftool/dbconfig/20221129-163942-ladsgroup.json [16:40:20] (03PS8) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 [16:41:05] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5023.mgmt.eqsin.wmnet with reboot policy FORCED [16:42:04] (03CR) 10JMeybohm: [C: 03+1] Enable profile::auto_restarts::service for dragonfly-supernode [puppet] - 10https://gerrit.wikimedia.org/r/861888 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [16:42:46] (03CR) 10Effie Mouzeli: [C: 03+1] Remove nutcracker from cloudweb hosts [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah) [16:43:26] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.12 [core] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/860600 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [16:46:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P41759 and previous config saved to /var/cache/conftool/dbconfig/20221129-164624-marostegui.json [16:49:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:49:51] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5022.mgmt.eqsin.wmnet with reboot policy FORCED [16:49:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:49:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:50:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Jclark-ctr) I have connected it to 1g. it is port 44 now for both servers. The switch will need to be configured for that block to be 1g @Papaul [16:50:10] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5021.mgmt.eqsin.wmnet with reboot policy FORCED [16:50:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:51:24] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5024.mgmt.eqsin.wmnet with reboot policy FORCED [16:51:35] (03PS9) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 [16:51:37] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5025.mgmt.eqsin.wmnet with reboot policy FORCED [16:52:30] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5023.mgmt.eqsin.wmnet with reboot policy FORCED [16:53:01] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5026.mgmt.eqsin.wmnet with reboot policy FORCED [16:53:55] (03PS10) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 [16:56:30] (03PS11) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 [16:58:54] (03PS12) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 (https://phabricator.wikimedia.org/T318816) [16:59:18] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/861890/38495/" [puppet] - 10https://gerrit.wikimedia.org/r/861890 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott) [17:00:04] jbond and rzl: Your horoscope predicts another unfortunate Puppet request window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:01:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P41760 and previous config saved to /var/cache/conftool/dbconfig/20221129-170131-marostegui.json [17:02:29] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Papaul) I will take a look once i have the OS going on db120[4-5] [17:02:54] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5024.mgmt.eqsin.wmnet with reboot policy FORCED [17:03:02] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5025.mgmt.eqsin.wmnet with reboot policy FORCED [17:03:41] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host cp5027.mgmt.eqsin.wmnet with reboot policy FORCED [17:04:06] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs5004.mgmt.eqsin.wmnet with reboot policy FORCED [17:04:32] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5026.mgmt.eqsin.wmnet with reboot policy FORCED [17:04:52] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti5004.mgmt.eqsin.wmnet with reboot policy FORCED [17:06:33] (03CR) 10MVernon: swift: move ms-be2050 to new naming schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:07:31] RECOVERY - Host an-worker1089 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [17:11:17] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs5004.mgmt.eqsin.wmnet with reboot policy FORCED [17:11:56] !log otto@deploy1002 Started deploy [analytics/refinery@c45b61d]: Regular analytics weekly train [analytics/refinery@c45b61d] [17:12:35] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host dns5004.mgmt.eqsin.wmnet with reboot policy FORCED [17:13:17] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [17:13:31] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [17:14:02] (03CR) 10Jbond: swift: move ms-be2050 to new naming schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:14:10] (03CR) 10Jbond: swift: move ms-be2050 to new naming schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [17:14:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1204.eqiad.wmnet with OS bullseye [17:14:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1204.eqiad.wmnet with OS bullseye [17:15:43] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cp5027.mgmt.eqsin.wmnet with reboot policy FORCED [17:15:50] !log otto@deploy1002 Finished deploy [analytics/refinery@c45b61d]: Regular analytics weekly train [analytics/refinery@c45b61d] (duration: 03m 54s) [17:15:59] !log otto@deploy1002 Started deploy [analytics/refinery@c45b61d] (thin): Regular analytics weekly train THIN [analytics/refinery@c45b61d] [17:16:08] !log otto@deploy1002 Finished deploy [analytics/refinery@c45b61d] (thin): Regular analytics weekly train THIN [analytics/refinery@c45b61d] (duration: 00m 09s) [17:16:09] !log otto@deploy1002 Started deploy [analytics/refinery@c45b61d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c45b61d] [17:16:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T321126)', diff saved to https://phabricator.wikimedia.org/P41761 and previous config saved to /var/cache/conftool/dbconfig/20221129-171638-marostegui.json [17:16:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1196.eqiad.wmnet with reason: Maintenance [17:16:46] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [17:17:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1196.eqiad.wmnet with reason: Maintenance [17:17:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T321126)', diff saved to https://phabricator.wikimedia.org/P41762 and previous config saved to /var/cache/conftool/dbconfig/20221129-171710-marostegui.json [17:17:13] !log otto@deploy1002 Finished deploy [analytics/refinery@c45b61d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c45b61d] (duration: 01m 03s) [17:18:22] !log otto@deploy1002 Started deploy [analytics/refinery@c45b61d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c45b61d] - an-test-coord1001 only [17:18:27] !log otto@deploy1002 Finished deploy [analytics/refinery@c45b61d] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@c45b61d] - an-test-coord1001 only (duration: 00m 04s) [17:18:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41763 and previous config saved to /var/cache/conftool/dbconfig/20221129-171827-ladsgroup.json [17:18:38] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [17:18:44] 10SRE, 10LDAP-Access-Requests: Grant Access to wmf for abartov - https://phabricator.wikimedia.org/T323911 (10SRamkisson) a:03SRamkisson Approved on my end, [17:19:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T321126)', diff saved to https://phabricator.wikimedia.org/P41764 and previous config saved to /var/cache/conftool/dbconfig/20221129-171931-marostegui.json [17:20:56] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti5004.mgmt.eqsin.wmnet with reboot policy FORCED [17:21:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1205.eqiad.wmnet with OS bullseye [17:21:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1205.eqiad.wmnet with OS bullseye [17:21:48] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host ganeti5004.mgmt.eqsin.wmnet with reboot policy FORCED [17:22:18] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dns5004.mgmt.eqsin.wmnet with reboot policy FORCED [17:22:26] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5021'] [17:22:46] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5022'] [17:23:22] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5023'] [17:25:20] (03CR) 10JMeybohm: [C: 03+1] "I've added the justification to https://phabricator.wikimedia.org/T303279 as well" [deployment-charts] - 10https://gerrit.wikimedia.org/r/861399 (https://phabricator.wikimedia.org/T303279) (owner: 10Elukey) [17:26:52] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1204.eqiad.wmnet with reason: host reimage [17:27:25] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:28:43] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:30:45] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for db1206 - pt1979@cumin2002" [17:31:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for db1206 - pt1979@cumin2002" [17:31:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:31:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1204.eqiad.wmnet with reason: host reimage [17:32:43] (03PS19) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [17:32:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1205.eqiad.wmnet with reason: host reimage [17:33:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P41765 and previous config saved to /var/cache/conftool/dbconfig/20221129-173334-ladsgroup.json [17:34:09] (03PS1) 10Arturo Borrero Gonzalez: openstack: haproxy: introduce hiera config hash [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [17:34:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P41766 and previous config saved to /var/cache/conftool/dbconfig/20221129-173438-marostegui.json [17:34:39] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5021'] [17:34:50] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [17:34:51] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5022'] [17:34:55] !log robh@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ganeti5004.mgmt.eqsin.wmnet with reboot policy FORCED [17:35:08] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5023'] [17:35:28] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5024'] [17:36:07] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5025'] [17:36:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1205.eqiad.wmnet with reason: host reimage [17:36:43] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5026'] [17:37:17] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5027'] [17:40:46] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: Cathal Mooney Known issue with carrier. T322529 - The acknowledgement expires at: 2023-01-01 00:00:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:41:46] ACKNOWLEDGEMENT - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: Cathal Mooney Known issue with carrier T322529. - The acknowledgement expires at: 2023-01-01 00:00:00. https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:41:53] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cp5025'] [17:42:28] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cp5025'] [17:43:20] (03CR) 10Arturo Borrero Gonzalez: "I'm still working on this. Sharing to gerrit as backup in case I lost my devel machine :-P" [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [17:45:24] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 140 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:45:27] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1204.eqiad.wmnet with OS bullseye [17:45:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1204.eqiad.wmnet with OS bullseye completed: - db1204 (**PASS**) - R... [17:46:05] (03PS4) 10DDesouza: Deploy Research Incentive survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851714 (https://phabricator.wikimedia.org/T321930) [17:47:34] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5024'] [17:48:04] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:48:37] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5026'] [17:48:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P41767 and previous config saved to /var/cache/conftool/dbconfig/20221129-174840-ladsgroup.json [17:49:29] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5027'] [17:49:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P41768 and previous config saved to /var/cache/conftool/dbconfig/20221129-174945-marostegui.json [17:51:18] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [17:51:34] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ganeti5004'] [17:52:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1205.eqiad.wmnet with OS bullseye [17:52:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db1205.eqiad.wmnet with OS bullseye completed: - db1205 (**PASS**) - R... [17:52:15] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dns5004'] [17:52:48] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [17:53:12] RECOVERY - SSH on mw1331.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [17:54:02] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 206 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:54:16] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cp5025'] [17:56:38] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [17:57:22] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 195 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:58:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Papaul) [17:59:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install db1204, db1205 - https://phabricator.wikimedia.org/T313978 (10Papaul) 05Open→03Resolved This is complete [17:59:12] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 6 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:02:19] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ganeti5004'] [18:03:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41769 and previous config saved to /var/cache/conftool/dbconfig/20221129-180347-ladsgroup.json [18:03:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [18:03:54] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [18:03:58] !log robh@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dns5004'] [18:04:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [18:04:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41770 and previous config saved to /var/cache/conftool/dbconfig/20221129-180408-ladsgroup.json [18:04:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T321126)', diff saved to https://phabricator.wikimedia.org/P41771 and previous config saved to /var/cache/conftool/dbconfig/20221129-180451-marostegui.json [18:04:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:04:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [18:04:58] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [18:05:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2097.codfw.wmnet with reason: Maintenance [18:05:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2097.codfw.wmnet with reason: Maintenance [18:05:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2102.codfw.wmnet with reason: Maintenance [18:05:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2102.codfw.wmnet with reason: Maintenance [18:06:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2103.codfw.wmnet with reason: Maintenance [18:06:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2103.codfw.wmnet with reason: Maintenance [18:06:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T321126)', diff saved to https://phabricator.wikimedia.org/P41772 and previous config saved to /var/cache/conftool/dbconfig/20221129-180646-marostegui.json [18:08:37] (03PS1) 10Jbond: idp_test: Add idp-dev service [puppet] - 10https://gerrit.wikimedia.org/r/861906 [18:08:39] (03PS1) 10Jbond: P:idp::standalone: fix minor formating issues [puppet] - 10https://gerrit.wikimedia.org/r/861907 [18:09:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T321126)', diff saved to https://phabricator.wikimedia.org/P41773 and previous config saved to /var/cache/conftool/dbconfig/20221129-180909-marostegui.json [18:09:40] (03CR) 10Jbond: [C: 03+2] idp_test: Add idp-dev service [puppet] - 10https://gerrit.wikimedia.org/r/861906 (owner: 10Jbond) [18:09:51] (03CR) 10Jbond: [C: 03+2] P:idp::standalone: fix minor formating issues [puppet] - 10https://gerrit.wikimedia.org/r/861907 (owner: 10Jbond) [18:14:40] ACKNOWLEDGEMENT - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 Cathal Mooney Issue with GTT transport - T324047 - The acknowledgement expires at: 2022-12-01 18:13:45. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:15:00] ACKNOWLEDGEMENT - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP Cathal Mooney Issue with GTT transport - T324047 - The acknowledgement expires at: 2022-12-01 18:13:56. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:15:57] (03PS1) 10Jbond: idp cloud: switch to id-test [puppet] - 10https://gerrit.wikimedia.org/r/861908 [18:16:20] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp cloud: switch to id-test [puppet] - 10https://gerrit.wikimedia.org/r/861908 (owner: 10Jbond) [18:17:28] jouncebot nowandnext [18:17:29] No deployments scheduled for the next 0 hour(s) and 42 minute(s) [18:17:29] In 0 hour(s) and 42 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T1900) [18:21:49] (03PS1) 10Ssingh: cp5021: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861909 (https://phabricator.wikimedia.org/T322048) [18:21:51] (03PS1) 10Ssingh: cp5022: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861910 (https://phabricator.wikimedia.org/T322048) [18:21:53] (03PS1) 10Ssingh: cp5023: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861911 (https://phabricator.wikimedia.org/T322048) [18:21:55] ACKNOWLEDGEMENT - OSPF status on cr2-drmrs is CRITICAL: OSPFv2: 2/4 UP : OSPFv3: 2/4 UP Cathal Mooney Known issue with GTT transport T324047 - The acknowledgement expires at: 2022-12-01 18:21:20. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:21:55] (03PS1) 10Ssingh: cp5024: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861912 (https://phabricator.wikimedia.org/T322048) [18:21:57] (03PS1) 10Ssingh: cp5025: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861913 (https://phabricator.wikimedia.org/T322048) [18:21:59] (03PS1) 10Ssingh: cp5026: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861914 (https://phabricator.wikimedia.org/T322048) [18:22:01] (03PS1) 10Ssingh: cp5027: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861915 (https://phabricator.wikimedia.org/T322048) [18:22:40] ACKNOWLEDGEMENT - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP Cathal Mooney Known issue with GTT transport T324047 - The acknowledgement expires at: 2022-12-01 18:22:22. https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:22:54] ACKNOWLEDGEMENT - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 2 Cathal Mooney Known issue with GTT transport T324047 - The acknowledgement expires at: 2022-12-01 18:22:41. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:23:59] (03PS1) 10Jbond: idp01: use idp-test not dev [puppet] - 10https://gerrit.wikimedia.org/r/861916 [18:24:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P41774 and previous config saved to /var/cache/conftool/dbconfig/20221129-182416-marostegui.json [18:24:54] (03CR) 10Jbond: [C: 03+2] idp01: use idp-test not dev [puppet] - 10https://gerrit.wikimedia.org/r/861916 (owner: 10Jbond) [18:25:06] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp01: use idp-test not dev [puppet] - 10https://gerrit.wikimedia.org/r/861916 (owner: 10Jbond) [18:27:48] (03CR) 10Ssingh: [C: 03+2] cp5021: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861909 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [18:28:56] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5021.eqsin.wmnet with OS buster [18:29:05] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5021.eqsin.wmnet with OS buster [18:32:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:33:54] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:36:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:36:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:36:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:37:03] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:39:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P41775 and previous config saved to /var/cache/conftool/dbconfig/20221129-183922-marostegui.json [18:40:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41776 and previous config saved to /var/cache/conftool/dbconfig/20221129-184047-ladsgroup.json [18:40:54] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [18:41:50] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [18:42:03] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:42:20] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp5021.eqsin.wmnet with OS buster [18:42:28] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5021.eqsin.wmnet with OS buster executed with errors: - cp5021 (**... [18:43:07] !log sukhe@cumin2002:~$ sudo ipmitool -I lanplus -H "cp5021.mgmt.eqsin.wmnet" -U root -E chassis power cycle [18:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:56] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8518126600 and 1492 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:46:08] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:53:38] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 23276262424 and 1954 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [18:54:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T321126)', diff saved to https://phabricator.wikimedia.org/P41777 and previous config saved to /var/cache/conftool/dbconfig/20221129-185429-marostegui.json [18:54:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2116.codfw.wmnet with reason: Maintenance [18:54:32] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [18:54:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2116.codfw.wmnet with reason: Maintenance [18:54:50] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [18:54:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T321126)', diff saved to https://phabricator.wikimedia.org/P41778 and previous config saved to /var/cache/conftool/dbconfig/20221129-185450-marostegui.json [18:55:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P41779 and previous config saved to /var/cache/conftool/dbconfig/20221129-185553-ladsgroup.json [18:56:20] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:57:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321126)', diff saved to https://phabricator.wikimedia.org/P41780 and previous config saved to /var/cache/conftool/dbconfig/20221129-185714-marostegui.json [18:57:15] (03PS2) 10Bartosz Dziewoński: Use new DiscussionTools heading markup on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856551 (https://phabricator.wikimedia.org/T314714) [18:58:08] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18720272 and 2224 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:00:05] dancy and brennen: gettimeofday() says it's time for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T1900) [19:00:54] The train is currently blocked on a security patch issue as well as a blocker task https://phabricator.wikimedia.org/T324028 [19:01:07] (03PS1) 10Bartosz Dziewoński: Only match article path until first '?' when parsing links [extensions/DiscussionTools] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861475 (https://phabricator.wikimedia.org/T324028) [19:03:26] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8298212840 and 2541 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:06:44] (03CR) 10Andrea Denisse: [C: 03+2] netmon: Add netmon2002 to the alertmanager rw api [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [19:07:48] (03PS20) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [19:10:27] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [19:11:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P41781 and previous config saved to /var/cache/conftool/dbconfig/20221129-191100-ladsgroup.json [19:11:32] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 200 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:12:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P41782 and previous config saved to /var/cache/conftool/dbconfig/20221129-191220-marostegui.json [19:12:48] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:12:50] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 224577040 and 3106 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:14:18] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 967408 and 3193 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:16:38] (03PS1) 10Majavah: exim: Disable IPv6 on mail hosts on cloud vms [puppet] - 10https://gerrit.wikimedia.org/r/861924 (https://phabricator.wikimedia.org/T324051) [19:17:27] (03PS1) 10Eevans: echostore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861925 (https://phabricator.wikimedia.org/T253244) [19:17:43] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38498/console" [puppet] - 10https://gerrit.wikimedia.org/r/861924 (https://phabricator.wikimedia.org/T324051) (owner: 10Majavah) [19:20:21] (03PS1) 10MusikAnimal: Enable Phonos on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861926 (https://phabricator.wikimedia.org/T321084) [19:21:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:25:06] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 630000 and 3841 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:26:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T323907)', diff saved to https://phabricator.wikimedia.org/P41783 and previous config saved to /var/cache/conftool/dbconfig/20221129-192606-ladsgroup.json [19:26:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [19:26:14] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [19:26:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [19:26:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T323907)', diff saved to https://phabricator.wikimedia.org/P41784 and previous config saved to /var/cache/conftool/dbconfig/20221129-192628-ladsgroup.json [19:26:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH events) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:27:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P41785 and previous config saved to /var/cache/conftool/dbconfig/20221129-192728-marostegui.json [19:39:06] (03PS1) 10Jbond: apereo_cas: fix delegated authentication config [puppet] - 10https://gerrit.wikimedia.org/r/861929 [19:40:06] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38499/console" [puppet] - 10https://gerrit.wikimedia.org/r/861929 (owner: 10Jbond) [19:41:16] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Papaul) @Jclark-ctr netbox is showing that the server is racked in B8 or on the task it says that the server is in rack B1 (db1206 B1 U36 Port 26 ) can you please double check. Thanks. [19:42:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T321126)', diff saved to https://phabricator.wikimedia.org/P41786 and previous config saved to /var/cache/conftool/dbconfig/20221129-194235-marostegui.json [19:42:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2130.codfw.wmnet with reason: Maintenance [19:42:43] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [19:42:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2130.codfw.wmnet with reason: Maintenance [19:42:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T321126)', diff saved to https://phabricator.wikimedia.org/P41787 and previous config saved to /var/cache/conftool/dbconfig/20221129-194257-marostegui.json [19:43:59] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Papaul) a:05Cmjohnson→03Papaul [19:45:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321126)', diff saved to https://phabricator.wikimedia.org/P41788 and previous config saved to /var/cache/conftool/dbconfig/20221129-194520-marostegui.json [19:45:22] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 5058 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:00:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P41789 and previous config saved to /var/cache/conftool/dbconfig/20221129-200027-marostegui.json [20:02:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T323907)', diff saved to https://phabricator.wikimedia.org/P41790 and previous config saved to /var/cache/conftool/dbconfig/20221129-200233-ladsgroup.json [20:02:40] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [20:10:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:15:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P41791 and previous config saved to /var/cache/conftool/dbconfig/20221129-201533-marostegui.json [20:15:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:17:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P41792 and previous config saved to /var/cache/conftool/dbconfig/20221129-201739-ladsgroup.json [20:20:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] apereo_cas: fix delegated authentication config [puppet] - 10https://gerrit.wikimedia.org/r/861929 (owner: 10Jbond) [20:30:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T321126)', diff saved to https://phabricator.wikimedia.org/P41793 and previous config saved to /var/cache/conftool/dbconfig/20221129-203040-marostegui.json [20:30:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2141.codfw.wmnet with reason: Maintenance [20:30:48] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [20:30:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2141.codfw.wmnet with reason: Maintenance [20:31:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2145.codfw.wmnet with reason: Maintenance [20:31:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2145.codfw.wmnet with reason: Maintenance [20:31:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T321126)', diff saved to https://phabricator.wikimedia.org/P41794 and previous config saved to /var/cache/conftool/dbconfig/20221129-203135-marostegui.json [20:32:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P41795 and previous config saved to /var/cache/conftool/dbconfig/20221129-203246-ladsgroup.json [20:33:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321126)', diff saved to https://phabricator.wikimedia.org/P41796 and previous config saved to /var/cache/conftool/dbconfig/20221129-203359-marostegui.json [20:35:05] (03PS2) 10Jbond: apero_cas: (WIP) add addtional paramas for OIDC [puppet] - 10https://gerrit.wikimedia.org/r/858362 (https://phabricator.wikimedia.org/T311999) [20:36:55] 10SRE, 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) [20:37:29] 10SRE, 10Data-Engineering, 10Product-Analytics, 10Wmfdata-Python: wmfdata.mariadb relies on analytics-mysql being available - https://phabricator.wikimedia.org/T292479 (10nshahquinn-wmf) Updated the description to note: > In addition, analytics-mysql is not available on an-test-client1001, which complicate... [20:42:05] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38500/console" [puppet] - 10https://gerrit.wikimedia.org/r/858362 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [20:43:38] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 2 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10XenoRyet) [20:43:55] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 2 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10greg) Hey ya'll, just making sure there's nothing for us on FR-Tech to do right now. I assume it's still in "Let the Email team and Traffic... [20:47:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T323907)', diff saved to https://phabricator.wikimedia.org/P41797 and previous config saved to /var/cache/conftool/dbconfig/20221129-204752-ladsgroup.json [20:48:00] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [20:49:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P41798 and previous config saved to /var/cache/conftool/dbconfig/20221129-204905-marostegui.json [20:52:08] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:54:48] PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:56:06] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:56:06] (03PS21) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [20:58:17] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [21:00:04] RoanKattouw, Urbanecm, cjming, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221129T2100). [21:00:04] wugapodes, ebernhardson, danisztls, and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:13] i can deploy today! [21:00:18] o/ [21:00:25] here [21:01:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851714 (https://phabricator.wikimedia.org/T321930) (owner: 10DDesouza) [21:01:30] \o [21:01:35] MatmaRex: hi, around? [21:01:55] (03PS22) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [21:02:13] (03Merged) 10jenkins-bot: Deploy Research Incentive survey on frwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/851714 (https://phabricator.wikimedia.org/T321930) (owner: 10DDesouza) [21:02:46] hi wugapodes, ad your patch, where are the messages defined please? Since it's enwiki-only contact page, I'd expect them on wiki (ie. at https://en.wikipedia.org/wiki/MediaWiki:Contactpage-arbcom-block-appeal-prior), but I don't see anything there. [21:02:55] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:851714|Deploy Research Incentive survey on frwiki (T321930)]] [21:02:56] hi [21:03:03] T321930: Deploy Research Incentive Survey targeting Sub-Saharan Africa on French Wikipedia - https://phabricator.wikimedia.org/T321930 [21:03:06] 10SRE, 10SRE-Access-Requests: Turnilo access request for User:wfan - https://phabricator.wikimedia.org/T324057 (10AnnWF) [21:03:24] hi, +2'ing the backport [21:03:30] urbanecm: they'll be defined on-wiki, I wasn't sure if I should add them before or after deployment, I can add them now [21:03:38] (03CR) 10Urbanecm: [C: 03+2] Only match article path until first '?' when parsing links [extensions/DiscussionTools] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861475 (https://phabricator.wikimedia.org/T324028) (owner: 10Bartosz Dziewoński) [21:03:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:04:05] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [21:04:12] wugapodes: it's better to have them before deployment, so it can be tested if everything works right during deployment. if you can add them now, that'd be great. ping me once done, I'll proceed with your patch then. [21:04:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P41799 and previous config saved to /var/cache/conftool/dbconfig/20221129-210412-marostegui.json [21:04:22] ack [21:04:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:04:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:05:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:05:56] !log urbanecm@deploy1002 urbanecm and dani: Backport for [[gerrit:851714|Deploy Research Incentive survey on frwiki (T321930)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [21:06:00] danisztls: your patch is now at mwdebug1001, can you test it there please? [21:06:40] urbanecm: yes [21:06:53] great, let me know how it looks like :) [21:07:19] (03PS3) 10Urbanecm: Use new DiscussionTools heading markup on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856551 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:07:31] (03CR) 10Urbanecm: [C: 03+2] Use new DiscussionTools heading markup on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856551 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:08:16] (03Merged) 10jenkins-bot: Use new DiscussionTools heading markup on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856551 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:08:30] (03Merged) 10jenkins-bot: Only match article path until first '?' when parsing links [extensions/DiscussionTools] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861475 (https://phabricator.wikimedia.org/T324028) (owner: 10Bartosz Dziewoński) [21:08:48] urbanecm: survey looks good but it's bellow the infobox for some reason. non ideal but not a blocker [21:08:56] danisztls: so, ok to sync? [21:09:03] urbanecm: yes [21:09:05] okay, syncing [21:09:46] urbanecm: thanks! [21:09:49] np [21:09:58] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:10:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:10:38] and...the scap error is back [21:10:44] this is transcript https://www.irccloud.com/pastebin/W6n6EZNG/ [21:11:37] (03PS23) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [21:11:56] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:11:58] urbanecm: with that specific host only? [21:12:18] robh: sukhe: install5001 is already telling us the issue [21:12:25] taavi: that, and deploy2002.codfw.wmnet [21:12:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:12:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:12:48] updated the https://phabricator.wikimedia.org/T324023#8430131 paste [21:12:50] urbanecm: the interesting host is in 'ran as mwdeploy@mw1498.eqiad.wmnet' [21:12:57] aha [21:13:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:13:34] mw1498, mw1491, mw1496, mw1493, mw1492 [21:13:36] those one so far [21:14:27] mw1497 now [21:14:45] yeah, so not a single host or rack [21:15:29] unfortunately not, more errors popped then after a while :/ [21:15:36] updated phab comment yet again [21:16:24] taavi: unfortunately i don't fully understand what happened: do you know what the impact of this error is? is it "affected mw hosts were not deployed to"? [21:17:14] urbanecm: yes. mw1489-1498 are not getting the code updates [21:17:19] :/ [21:17:24] 10SRE, 10SRE-Access-Requests: Turnilo access request for User:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Damilare) [21:18:11] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:851714|Deploy Research Incentive survey on frwiki (T321930)]] (duration: 15m 15s) [21:18:18] T321930: Deploy Research Incentive Survey targeting Sub-Saharan Africa on French Wikipedia - https://phabricator.wikimedia.org/T321930 [21:18:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:18:41] * urbanecm waits for the hosts to be marked as inactive [21:19:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T321126)', diff saved to https://phabricator.wikimedia.org/P41800 and previous config saved to /var/cache/conftool/dbconfig/20221129-211918-marostegui.json [21:19:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2146.codfw.wmnet with reason: Maintenance [21:19:26] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [21:19:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:19:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:19:34] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2146.codfw.wmnet with reason: Maintenance [21:19:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T321126)', diff saved to https://phabricator.wikimedia.org/P41801 and previous config saved to /var/cache/conftool/dbconfig/20221129-211940-marostegui.json [21:20:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:20:48] urbanecm: done creating the messages, you can see them in my recent contribs if you want to review them https://en.wikipedia.org/wiki/Special:Contributions/Wugapodes [21:21:40] thanks wugapodes! we're having a technical issue with the deployment tooling; I'll ping you once the contact page can be tested. [21:21:54] ack [21:22:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321126)', diff saved to https://phabricator.wikimedia.org/P41802 and previous config saved to /var/cache/conftool/dbconfig/20221129-212203-marostegui.json [21:22:16] wugapodes: in the meantime: do you have https://wikitech.wikimedia.org/wiki/WikimediaDebug#Browser_extensions installed please? [21:22:37] yep [21:22:45] okay, great [21:23:10] !log jhathaway@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1489.eqiad.wmnet [21:23:59] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cp5021 [21:24:04] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cp5021 [21:24:27] jhathaway: afaik it needs to be pooled=inactive. i don't think pooled=no removes it from scap (which is what's needed in this case). [21:24:38] yes, you need pooled=inactive for this [21:24:48] roger, thanks... [21:25:09] 10SRE, 10SRE-Access-Requests: Turnilo access request for User:wfan - https://phabricator.wikimedia.org/T324057 (10AnnWF) [21:25:25] !log jhathaway@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1489.eqiad.wmnet [21:29:37] !log jhathaway@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1489.eqiad.wmnet [21:29:37] !log jhathaway@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1490.eqiad.wmnet [21:29:38] !log jhathaway@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1491.eqiad.wmnet [21:29:38] !log jhathaway@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1492.eqiad.wmnet [21:29:38] !log jhathaway@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1493.eqiad.wmnet [21:29:39] !log jhathaway@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1494.eqiad.wmnet [21:29:39] !log jhathaway@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1495.eqiad.wmnet [21:29:39] !log jhathaway@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1496.eqiad.wmnet [21:29:40] !log jhathaway@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1497.eqiad.wmnet [21:29:40] !log jhathaway@puppetmaster1001 conftool action : set/pooled=inactive; selector: name=mw1498.eqiad.wmnet [21:29:50] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/861358 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [21:30:36] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/861359 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [21:31:09] Who has op to change topic? [21:31:11] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [dns] - 10https://gerrit.wikimedia.org/r/861357 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [21:31:26] robh: me, for example! [21:31:37] what should i change? [21:31:56] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861361 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [21:31:57] can you update topic to read netbox maint underway do not edit netbox [21:32:06] we're having to revert a netbox deletion i did ; D [21:32:24] robh: done [21:33:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856551 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:33:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861475 (https://phabricator.wikimedia.org/T324028) (owner: 10Bartosz Dziewoński) [21:33:42] robh: fwiw the source of "who has op in -operations" is https://github.com/wikimedia/wikimedia-irc-ircservserv-config/blob/master/channels/wikimedia-operations.toml [21:34:22] ...a _different_ scap error, `21:33:10 backport failed: '/srv/mediawiki-staging/php-1.40.0-wmf.12'` [21:34:38] that's due to wmf.12 backport when wmf.12 is not yet deployed yet. sigh. [21:34:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856551 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:34:57] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:856551|Use new DiscussionTools heading markup on plwiki (T314714)]] [21:35:04] T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714 [21:35:14] i'll sync the backport manually [21:35:41] please hold any netbox edit or run of the decommission/reimage cookbook for the next ~15 minutes, I have to restore a netbox backup (see -dcops for context) [21:35:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:36:03] !log urbanecm@deploy1002 urbanecm and matmarex: Backport for [[gerrit:856551|Use new DiscussionTools heading markup on plwiki (T314714)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:36:11] MatmaRex: config patch's at mwdebug1001 now, please test! [21:36:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:36:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:36:50] looking [21:37:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P41803 and previous config saved to /var/cache/conftool/dbconfig/20221129-213710-marostegui.json [21:37:41] urbanecm: looks good [21:37:46] great, syncing [21:38:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:39:00] PROBLEM - SSH on mw1330.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:40:02] mw1498.eqiad.wmnet and mw1491.eqiad.wmnet errored again, everything else passed [21:40:33] not sure why. because "scap did not see the conftool change"? cc taavi [21:40:45] hmm [21:40:46] um [21:40:58] just those two? [21:41:01] correct [21:41:25] in the sync proxies stage, no issues in sync apaches [21:41:43] (03PS1) 10Majavah: Revert "scap: add proxies in row E and F" [puppet] - 10https://gerrit.wikimedia.org/r/861477 [21:42:00] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:856551|Use new DiscussionTools heading markup on plwiki (T314714)]] (duration: 07m 02s) [21:42:07] T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714 [21:43:01] (03CR) 10JHathaway: [C: 03+2] Revert "scap: add proxies in row E and F" [puppet] - 10https://gerrit.wikimedia.org/r/861477 (owner: 10Majavah) [21:43:12] !log Netbox emergency restore of backup psql-all-dbs-2022-11-29-20-37.sql.gz to revert a deleted device [21:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:40] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:45:48] trying re-sync, to ensure no other erorrs happen (and to double-ensure everything got synced) [21:45:49] !log urbanecm@deploy1002 backport aborted: (duration: 00m 00s) [21:45:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856551 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [21:46:00] Hey all, I can deploy a new scap with the bugfix. [21:46:03] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:856551|Use new DiscussionTools heading markup on plwiki (T314714)]] [21:46:26] I'll start the release process now to get the ball rolling. [21:46:32] dancy: well, the bug was already workarounded with removing the impacted servers [21:46:35] ok you can now resume normal netbox-related work, thanks [21:46:36] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_drmrs02_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:42] Apologies for the bugs. [21:46:45] i'd prefer keeping it as-is now, assuming it works [21:46:49] and release after B&C [21:46:58] volans: ok for me to go back to Status: ok? [21:47:01] (in topic) [21:47:03] ok. let me know when you're ready [21:47:06] will do [21:48:03] dancy: also, i filled T324060 about (another) scap bug, if you've a while to check it, would be appreciated. i can workaround that one easily though. [21:48:03] T324060: scap backport: KeyError: '/srv/mediawiki-staging/php-1.40.0-wmf.12' - https://phabricator.wikimedia.org/T324060 [21:48:45] and...no scap errors this time! thanks taavi and jhathaway for the help [21:49:04] sweet! [21:49:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:49:29] urbanecm:no blockers, it was just netbox-related [21:50:18] volans: yeah, rob.h asked me to put "do not edit netbox" into topic, so i'm double checking if that's ok to remove now. i assume it is considering your message. [21:50:29] ok it can come out now yep [21:50:33] volans pushed the fix for me [21:50:35] thanks! [21:50:45] okay then! [21:51:05] ah sorry I misread, yes, thanks [21:51:18] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:856551|Use new DiscussionTools heading markup on plwiki (T314714)]] (duration: 05m 15s) [21:51:26] T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714 [21:51:40] (03PS3) 10Urbanecm: Add ContactPage and ArbCom form to EnWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860946 (https://phabricator.wikimedia.org/T321447) (owner: 10Wugapodes) [21:51:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:51:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:51:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860946 (https://phabricator.wikimedia.org/T321447) (owner: 10Wugapodes) [21:52:04] wugapodes: going ahead with the contact page now! [21:52:11] ty! [21:52:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P41804 and previous config saved to /var/cache/conftool/dbconfig/20221129-215216-marostegui.json [21:52:26] PROBLEM - mediawiki-installation DSH group on mw1492 is CRITICAL: Host mw1492 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:52:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:52:28] (03Merged) 10jenkins-bot: Add ContactPage and ArbCom form to EnWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860946 (https://phabricator.wikimedia.org/T321447) (owner: 10Wugapodes) [21:52:42] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:860946|Add ContactPage and ArbCom form to EnWiki (T321447)]] [21:52:48] T321447: Add Extension:ContactPage to EnWiki for Arbitration Committee - https://phabricator.wikimedia.org/T321447 [21:53:43] !log urbanecm@deploy1002 urbanecm and wug: Backport for [[gerrit:860946|Add ContactPage and ArbCom form to EnWiki (T321447)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:53:44] (03PS1) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [21:53:56] wugapodes: can you test your patch at mwdebug1001 now please? [21:54:58] PROBLEM - mediawiki-installation DSH group on mw1489 is CRITICAL: Host mw1489 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [21:56:30] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:56:38] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:57:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [21:57:42] (03PS1) 10Urbanecm: noc: Publicly expose EnWikiContactPages.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861941 (https://phabricator.wikimedia.org/T321447) [21:58:38] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [21:58:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [21:58:44] (03PS1) 10Urbanecm: noc: Update symlink to reverse-proxy-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861942 [21:59:35] urbanecm: when submitted one of the variables doesn't seem to populate ("Your email message has been sent. $1 will also receive a notification about your email unless they have disabled this in their preferences.") but there doesn't seem to be any functional issues [21:59:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [21:59:39] (03CR) 10Dzahn: [C: 03+2] Enable profile::auto_restarts::service for Envoy on releases* [puppet] - 10https://gerrit.wikimedia.org/r/861846 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [22:00:17] wugapodes: that seems to be an issue with Extension:ContactPage itself. I'll sync it out; can you please fill a Phabricator ticket about this, so it can be looked at later? [22:00:24] (03PS2) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [22:00:25] will do [22:00:27] thanks [22:00:49] !log UTC late backport window is overrunning a bit [22:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:23] (03CR) 10Urbanecm: [C: 03+2] noc: Publicly expose EnWikiContactPages.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861941 (https://phabricator.wikimedia.org/T321447) (owner: 10Urbanecm) [22:01:43] (03CR) 10Urbanecm: [C: 03+2] noc: Update symlink to reverse-proxy-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861942 (owner: 10Urbanecm) [22:02:19] (03Merged) 10jenkins-bot: noc: Publicly expose EnWikiContactPages.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861941 (https://phabricator.wikimedia.org/T321447) (owner: 10Urbanecm) [22:02:23] !log [releases1002:~] $ sudo systemctl start wmf_auto_restart_envoyproxy.service | test after deploying gerrit:861846 [22:02:26] (03Merged) 10jenkins-bot: noc: Update symlink to reverse-proxy-labs.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861942 (owner: 10Urbanecm) [22:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:30] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:860946|Add ContactPage and ArbCom form to EnWiki (T321447)]] (duration: 11m 47s) [22:04:35] !log [releases2002:~] $ sudo systemctl status wmf_auto_restart_envoyproxy.service [22:04:37] T321447: Add Extension:ContactPage to EnWiki for Arbitration Committee - https://phabricator.wikimedia.org/T321447 [22:04:39] (03PS13) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 (https://phabricator.wikimedia.org/T318816) [22:04:41] (03PS3) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [22:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:46] !log urbanecm@deploy1002 backport aborted: (duration: 00m 00s) [22:04:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861941 (https://phabricator.wikimedia.org/T321447) (owner: 10Urbanecm) [22:04:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861942 (owner: 10Urbanecm) [22:05:01] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:861941|noc: Publicly expose EnWikiContactPages.php (T321447)]], [[gerrit:861942|noc: Update symlink to reverse-proxy-labs.php]] [22:05:06] !log robh@cumin2002 START - Cookbook sre.dns.netbox [22:05:09] (03CR) 10Dzahn: [C: 03+2] "tested. both cases looked fine:" [puppet] - 10https://gerrit.wikimedia.org/r/861846 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [22:05:26] (03PS24) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [22:06:15] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:07:00] (03CR) 10Dzahn: "I think realistically I am not going to have new comments on this one and would respectfully remove myself as reviewer." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [22:07:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T321126)', diff saved to https://phabricator.wikimedia.org/P41806 and previous config saved to /var/cache/conftool/dbconfig/20221129-220723-marostegui.json [22:07:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2153.codfw.wmnet with reason: Maintenance [22:07:30] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [22:07:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2153.codfw.wmnet with reason: Maintenance [22:07:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2153 (T321126)', diff saved to https://phabricator.wikimedia.org/P41807 and previous config saved to /var/cache/conftool/dbconfig/20221129-220745-marostegui.json [22:08:43] (03PS5) 10Dzahn: thumbor/mwmaint: add periodic job to pull fc-list file [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) [22:08:55] (03CR) 10Dzahn: "needed manual rebase" [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [22:08:59] (03CR) 10CI reject: [V: 04-1] thumbor/mwmaint: add periodic job to pull fc-list file [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [22:09:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [22:10:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321126)', diff saved to https://phabricator.wikimedia.org/P41808 and previous config saved to /var/cache/conftool/dbconfig/20221129-221008-marostegui.json [22:10:11] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:861941|noc: Publicly expose EnWikiContactPages.php (T321447)]], [[gerrit:861942|noc: Update symlink to reverse-proxy-labs.php]] (duration: 05m 10s) [22:10:21] T321447: Add Extension:ContactPage to EnWiki for Arbitration Committee - https://phabricator.wikimedia.org/T321447 [22:10:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [22:10:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [22:10:33] ebernhardson: hi, if you're still around, can you deploy your patch please? [22:10:38] urbanecm: sure [22:10:55] (03PS6) 10Dzahn: thumbor/mwmaint: add periodic job to pull fc-list file [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) [22:10:56] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18705381944 and 13792 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:11:00] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18338382392 and 13796 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:11:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [22:11:10] (03CR) 10CI reject: [V: 04-1] thumbor/mwmaint: add periodic job to pull fc-list file [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [22:11:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861897 (https://phabricator.wikimedia.org/T323687) (owner: 10Ebernhardson) [22:11:27] (03PS2) 10Ebernhardson: cirrus: Enable document size limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861897 (https://phabricator.wikimedia.org/T323687) [22:11:38] (03CR) 10TrainBranchBot: "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861897 (https://phabricator.wikimedia.org/T323687) (owner: 10Ebernhardson) [22:11:48] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8827124768 and 13843 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:12:22] (03Merged) 10jenkins-bot: cirrus: Enable document size limiting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861897 (https://phabricator.wikimedia.org/T323687) (owner: 10Ebernhardson) [22:12:32] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 13917572584 and 13888 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:12:35] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:861897|cirrus: Enable document size limiting (T323687)]] [22:12:42] T323687: Enable the wmf_capped doc size limiter in the mediawiki-config for CirrusSearch - https://phabricator.wikimedia.org/T323687 [22:13:37] !log ebernhardson@deploy1002 ebernhardson and ebernhardson: Backport for [[gerrit:861897|cirrus: Enable document size limiting (T323687)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [22:16:05] (03PS4) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [22:16:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [22:16:52] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [22:17:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [22:17:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [22:17:54] PROBLEM - mediawiki-installation DSH group on mw1491 is CRITICAL: Host mw1491 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:18:23] (03PS5) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [22:18:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [22:18:39] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:861897|cirrus: Enable document size limiting (T323687)]] (duration: 06m 03s) [22:18:46] T323687: Enable the wmf_capped doc size limiter in the mediawiki-config for CirrusSearch - https://phabricator.wikimedia.org/T323687 [22:20:18] ebernhardson: was that all? :) [22:20:25] urbanecm: yup, all set [22:20:30] okay, great! [22:20:37] dancy: B&C's all done now, ready for new scap :) [22:23:28] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 16079140936 and 14543 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:25:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P41809 and previous config saved to /var/cache/conftool/dbconfig/20221129-222514-marostegui.json [22:26:15] (03PS25) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [22:28:16] PROBLEM - mediawiki-installation DSH group on mw1495 is CRITICAL: Host mw1495 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:28:40] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [22:30:28] PROBLEM - mediawiki-installation DSH group on mw1490 is CRITICAL: Host mw1490 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:31:24] (03PS26) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [22:31:28] (03PS6) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [22:32:29] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [22:33:08] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 149 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:34:36] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 236 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:36:59] (03PS7) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [22:37:20] (03PS1) 10Herron: wip: dynamic slo dashboard panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) [22:37:47] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [22:39:10] PROBLEM - mediawiki-installation DSH group on mw1493 is CRITICAL: Host mw1493 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:39:10] PROBLEM - mediawiki-installation DSH group on mw1494 is CRITICAL: Host mw1494 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:39:10] PROBLEM - mediawiki-installation DSH group on mw1498 is CRITICAL: Host mw1498 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:39:10] PROBLEM - mediawiki-installation DSH group on mw1496 is CRITICAL: Host mw1496 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:39:10] PROBLEM - mediawiki-installation DSH group on mw1497 is CRITICAL: Host mw1497 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [22:39:15] (03PS8) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [22:39:48] RECOVERY - SSH on mw1330.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:40:03] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [22:40:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P41810 and previous config saved to /var/cache/conftool/dbconfig/20221129-224021-marostegui.json [22:42:48] (03PS9) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [22:42:55] (03PS27) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [22:43:45] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [22:50:07] (03PS10) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [22:50:55] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [22:52:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [22:52:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance [22:55:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T321126)', diff saved to https://phabricator.wikimedia.org/P41811 and previous config saved to /var/cache/conftool/dbconfig/20221129-225527-marostegui.json [22:55:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2167.codfw.wmnet with reason: Maintenance [22:55:35] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [22:55:42] (03PS11) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [22:55:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2167.codfw.wmnet with reason: Maintenance [22:55:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T321126)', diff saved to https://phabricator.wikimedia.org/P41812 and previous config saved to /var/cache/conftool/dbconfig/20221129-225549-marostegui.json [22:56:38] RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:56:41] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [22:58:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321126)', diff saved to https://phabricator.wikimedia.org/P41813 and previous config saved to /var/cache/conftool/dbconfig/20221129-225814-marostegui.json [23:00:16] !log brennen@deploy1002 Installing scap version "4.29.3" for 600 hosts [23:01:44] !log brennen@deploy1002 Installing scap version "4.29.3" for 600 hosts [23:05:50] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 174 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:07:50] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:08:42] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 33937469808 and 2282 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:09:06] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 838392 and 2308 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:09:56] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1456344 and 2358 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:11:50] (03PS28) 10Effie Mouzeli: WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [23:13:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P41814 and previous config saved to /var/cache/conftool/dbconfig/20221129-231320-marostegui.json [23:13:36] (03PS14) 10Andrew Bogott: Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 (https://phabricator.wikimedia.org/T318816) [23:13:38] (03PS12) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [23:14:00] (03CR) 10CI reject: [V: 04-1] WIP:P:mediawiki::mcrouter_wancache Profile refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [23:14:37] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [23:16:39] (03PS13) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [23:17:38] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [23:19:15] (03PS14) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [23:20:02] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [23:21:36] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 138397360 and 3057 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:26:22] (03PS29) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache minor refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [23:26:24] (03PS15) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [23:26:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [23:26:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1105.eqiad.wmnet with reason: Maintenance [23:26:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41815 and previous config saved to /var/cache/conftool/dbconfig/20221129-232654-ladsgroup.json [23:27:01] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [23:27:42] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 896499088 and 3422 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:27:47] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [23:28:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P41816 and previous config saved to /var/cache/conftool/dbconfig/20221129-232827-marostegui.json [23:32:26] (03CR) 10Effie Mouzeli: "Sure there are more improvements, but for the time being, this is OK to unblock T258779" [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [23:33:08] (03CR) 10Effie Mouzeli: "PCC OK: https://puppet-compiler.wmflabs.org/output/860102/38517/" [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [23:35:40] (03PS16) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [23:35:50] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:35:54] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:36:40] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [23:37:44] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 597053064 and 4024 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:38:52] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1824592 and 4093 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:39:04] (03PS2) 10MusikAnimal: Enable Phonos on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861926 (https://phabricator.wikimedia.org/T321084) [23:39:18] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:39:26] (03PS3) 10MusikAnimal: Enable Phonos on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861926 (https://phabricator.wikimedia.org/T321084) [23:39:44] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.564 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:39:44] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1114120 and 4146 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:39:46] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48974 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:40:37] (03CR) 10CI reject: [V: 04-1] Enable Phonos on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861926 (https://phabricator.wikimedia.org/T321084) (owner: 10MusikAnimal) [23:41:08] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:42:06] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 622756568 and 4287 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:43:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T321126)', diff saved to https://phabricator.wikimedia.org/P41817 and previous config saved to /var/cache/conftool/dbconfig/20221129-234333-marostegui.json [23:43:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2170.codfw.wmnet with reason: Maintenance [23:43:40] (03CR) 10Andrew Bogott: [C: 03+2] Openstack config: move oslo_messaging_rabbit into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861890 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott) [23:43:42] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [23:43:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2170.codfw.wmnet with reason: Maintenance [23:43:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T321126)', diff saved to https://phabricator.wikimedia.org/P41818 and previous config saved to /var/cache/conftool/dbconfig/20221129-234354-marostegui.json [23:45:42] (03PS17) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [23:45:59] (03PS4) 10MusikAnimal: Enable Phonos on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861926 (https://phabricator.wikimedia.org/T321084) [23:46:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321126)', diff saved to https://phabricator.wikimedia.org/P41819 and previous config saved to /var/cache/conftool/dbconfig/20221129-234619-marostegui.json [23:46:56] (03CR) 10CI reject: [V: 04-1] Enable Phonos on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861926 (https://phabricator.wikimedia.org/T321084) (owner: 10MusikAnimal) [23:47:48] (03PS5) 10MusikAnimal: Enable Phonos on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861926 (https://phabricator.wikimedia.org/T321084) [23:47:50] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [23:48:08] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 548163368 and 4648 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:52:10] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 790910456 and 4890 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:54:46] (03PS18) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [23:54:48] (03PS1) 10Andrew Bogott: Openstack: advance a few last pieces from xena to yoga [puppet] - 10https://gerrit.wikimedia.org/r/861951 [23:55:47] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [23:58:10] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 5252 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:58:24] (03PS19) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940