[00:00:44] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P41820 and previous config saved to /var/cache/conftool/dbconfig/20221130-000125-marostegui.json [00:01:46] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [00:01:56] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41821 and previous config saved to /var/cache/conftool/dbconfig/20221130-000415-ladsgroup.json [00:04:22] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [00:04:40] (03PS20) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [00:06:02] (03PS2) 10Andrew Bogott: Openstack: advance a few last pieces from xena to yoga [puppet] - 10https://gerrit.wikimedia.org/r/861951 [00:06:04] (03PS21) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [00:06:56] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [00:08:56] (03CR) 10CI reject: [V: 04-1] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [00:14:28] PROBLEM - NFS Share Volume Space /srv/tools on labstore1004 is CRITICAL: DISK CRITICAL - free space: /srv/tools 1263824 MB (15% inode=68%): https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23NFS_volume_cleanup https://grafana.wikimedia.org/d/50z0i4XWz/tools-overall-nfs-storage-utilization?orgId=1 [00:16:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P41822 and previous config saved to /var/cache/conftool/dbconfig/20221130-001632-marostegui.json [00:19:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P41823 and previous config saved to /var/cache/conftool/dbconfig/20221130-001921-ladsgroup.json [00:31:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T321126)', diff saved to https://phabricator.wikimedia.org/P41824 and previous config saved to /var/cache/conftool/dbconfig/20221130-003138-marostegui.json [00:31:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:31:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:31:46] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [00:31:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T321126)', diff saved to https://phabricator.wikimedia.org/P41825 and previous config saved to /var/cache/conftool/dbconfig/20221130-003149-marostegui.json [00:34:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T321126)', diff saved to https://phabricator.wikimedia.org/P41826 and previous config saved to /var/cache/conftool/dbconfig/20221130-003413-marostegui.json [00:34:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P41827 and previous config saved to /var/cache/conftool/dbconfig/20221130-003428-ladsgroup.json [00:37:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:40:55] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5021.eqsin.wmnet with OS buster [00:41:05] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5021.eqsin.wmnet with OS buster [00:45:48] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:28] (03PS1) 10Cwhite: install_server: set eqiad bullseye vms to install bullseye [puppet] - 10https://gerrit.wikimedia.org/r/861871 (https://phabricator.wikimedia.org/T321410) [00:47:30] (03PS1) 10Cwhite: install_server: set codfw logstash vms to install bullseye [puppet] - 10https://gerrit.wikimedia.org/r/861872 (https://phabricator.wikimedia.org/T321410) [00:47:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:49:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P41828 and previous config saved to /var/cache/conftool/dbconfig/20221130-004920-marostegui.json [00:49:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41829 and previous config saved to /var/cache/conftool/dbconfig/20221130-004934-ladsgroup.json [00:49:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1122.eqiad.wmnet with reason: Maintenance [00:49:42] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [00:49:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1122.eqiad.wmnet with reason: Maintenance [00:49:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T323907)', diff saved to https://phabricator.wikimedia.org/P41830 and previous config saved to /var/cache/conftool/dbconfig/20221130-004956-ladsgroup.json [01:04:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P41831 and previous config saved to /var/cache/conftool/dbconfig/20221130-010426-marostegui.json [01:10:58] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5021.eqsin.wmnet with reason: host reimage [01:14:22] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5021.eqsin.wmnet with reason: host reimage [01:14:43] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/861412 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [01:15:25] (03CR) 10Cwhite: [C: 03+1] ProductionServices: move to graphite1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861361 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [01:15:40] (03CR) 10Cwhite: [C: 03+1] wmnet: move writes to graphite1005 [dns] - 10https://gerrit.wikimedia.org/r/861357 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [01:16:00] (03CR) 10Cwhite: [C: 03+1] stats: failover writes to graphite1005 [puppet] - 10https://gerrit.wikimedia.org/r/861359 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [01:16:13] (03CR) 10Cwhite: [C: 03+1] graphite: move alerts to graphite1005 [puppet] - 10https://gerrit.wikimedia.org/r/861358 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [01:16:41] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/860906 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [01:19:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T321126)', diff saved to https://phabricator.wikimedia.org/P41832 and previous config saved to /var/cache/conftool/dbconfig/20221130-011933-marostegui.json [01:19:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2176.codfw.wmnet with reason: Maintenance [01:19:41] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [01:19:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2176.codfw.wmnet with reason: Maintenance [01:19:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T321126)', diff saved to https://phabricator.wikimedia.org/P41833 and previous config saved to /var/cache/conftool/dbconfig/20221130-011954-marostegui.json [01:22:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T321126)', diff saved to https://phabricator.wikimedia.org/P41834 and previous config saved to /var/cache/conftool/dbconfig/20221130-012218-marostegui.json [01:27:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T323907)', diff saved to https://phabricator.wikimedia.org/P41835 and previous config saved to /var/cache/conftool/dbconfig/20221130-012723-ladsgroup.json [01:27:30] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [01:37:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P41836 and previous config saved to /var/cache/conftool/dbconfig/20221130-013724-marostegui.json [01:37:45] (JobUnavailable) firing: (2) Reduced availability for job workhorse in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P41837 and previous config saved to /var/cache/conftool/dbconfig/20221130-014229-ladsgroup.json [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:08] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5021.eqsin.wmnet with OS buster [01:48:16] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5021.eqsin.wmnet with OS buster completed: - cp5021 (**WARN**) -... [01:48:42] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [01:52:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P41838 and previous config saved to /var/cache/conftool/dbconfig/20221130-015231-marostegui.json [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:57:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P41839 and previous config saved to /var/cache/conftool/dbconfig/20221130-015736-ladsgroup.json [02:07:14] (03PS1) 10Gergő Tisza: growthexperiments: Use min edit limit for user impact refresh [puppet] - 10https://gerrit.wikimedia.org/r/861964 (https://phabricator.wikimedia.org/T323958) [02:07:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T321126)', diff saved to https://phabricator.wikimedia.org/P41840 and previous config saved to /var/cache/conftool/dbconfig/20221130-020737-marostegui.json [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:45] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [02:08:48] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 8685461264 and 13090 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:09:24] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 834905128 and 13126 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:12:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T323907)', diff saved to https://phabricator.wikimedia.org/P41841 and previous config saved to /var/cache/conftool/dbconfig/20221130-021242-ladsgroup.json [02:12:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance [02:12:50] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [02:12:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1129.eqiad.wmnet with reason: Maintenance [02:13:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T323907)', diff saved to https://phabricator.wikimedia.org/P41842 and previous config saved to /var/cache/conftool/dbconfig/20221130-021304-ladsgroup.json [02:17:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:50] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cp5021.eqsin.wmnet with reason: downtimed after reimage (depooled); failed Icinga check, don't want it to alert at night [02:20:54] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cp5021.eqsin.wmnet with reason: downtimed after reimage (depooled); failed Icinga check, don't want it to alert at night [02:22:28] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18115490808 and 13909 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:22:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:22] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_ImageSuggestions_SendNotificationsForUnillustratedWatchedTitles_PT.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:10] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 7173554224 and 14312 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:29:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T323907)', diff saved to https://phabricator.wikimedia.org/P41843 and previous config saved to /var/cache/conftool/dbconfig/20221130-022953-ladsgroup.json [02:30:01] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [02:33:20] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 622656 and 68 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:35:06] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:35:54] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:40:20] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 300 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:44:53] (03CR) 10Andrew Bogott: [C: 03+2] Openstack: advance a few last pieces from xena to yoga [puppet] - 10https://gerrit.wikimedia.org/r/861951 (owner: 10Andrew Bogott) [02:45:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P41844 and previous config saved to /var/cache/conftool/dbconfig/20221130-024500-ladsgroup.json [02:47:43] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [02:54:58] (03PS22) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [03:00:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P41845 and previous config saved to /var/cache/conftool/dbconfig/20221130-030006-ladsgroup.json [03:05:24] (03CR) 10Andrew Bogott: [C: 04-2] "I believe this is still in use for Mediawiki to coordinate sessions between the two labweb servers." [puppet] - 10https://gerrit.wikimedia.org/r/861807 (https://phabricator.wikimedia.org/T277183) (owner: 10Majavah) [03:13:10] 10SRE, 10Fundraising-Backlog, 10Traffic-Icebox, 10fr-donorservices, and 2 others: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10EWilfong_WMF) Correct, @greg. I'm still gathering feedback from Acoustic's support and will report back here. Hoping to get more informati... [03:13:21] (03PS23) 10Andrew Bogott: Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 [03:15:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T323907)', diff saved to https://phabricator.wikimedia.org/P41846 and previous config saved to /var/cache/conftool/dbconfig/20221130-031513-ladsgroup.json [03:15:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [03:15:21] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [03:15:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1139.eqiad.wmnet with reason: Maintenance [03:47:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [03:47:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1146.eqiad.wmnet with reason: Maintenance [03:47:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41847 and previous config saved to /var/cache/conftool/dbconfig/20221130-034737-ladsgroup.json [03:47:45] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [04:57:16] (03PS1) 10KartikMistry: ContentTranslation: Disable machine translation for Japanese WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861968 (https://phabricator.wikimedia.org/T323973) [05:04:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41848 and previous config saved to /var/cache/conftool/dbconfig/20221130-050412-ladsgroup.json [05:04:20] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [05:06:56] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:07:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:12:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:19:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P41849 and previous config saved to /var/cache/conftool/dbconfig/20221130-051918-ladsgroup.json [05:32:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:34:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P41850 and previous config saved to /var/cache/conftool/dbconfig/20221130-053425-ladsgroup.json [05:42:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:49:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41851 and previous config saved to /var/cache/conftool/dbconfig/20221130-054931-ladsgroup.json [05:49:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:49:40] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [05:49:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [05:49:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:50:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [05:50:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T323907)', diff saved to https://phabricator.wikimedia.org/P41852 and previous config saved to /var/cache/conftool/dbconfig/20221130-055010-ladsgroup.json [05:50:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:55:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:01:44] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T323907)', diff saved to https://phabricator.wikimedia.org/P41853 and previous config saved to /var/cache/conftool/dbconfig/20221130-063142-ladsgroup.json [06:31:50] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [06:34:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2112.codfw.wmnet with reason: Maintenance [06:34:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2112.codfw.wmnet with reason: Maintenance [06:35:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:35:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1163.eqiad.wmnet with reason: Maintenance [06:37:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:41:00] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P41854 and previous config saved to /var/cache/conftool/dbconfig/20221130-064649-ladsgroup.json [06:47:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:51:00] (03PS1) 10JMeybohm: helm-state-metrics: Update to v0.2.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/862173 (https://phabricator.wikimedia.org/T323706) [06:52:37] (03PS1) 10JMeybohm: helm-state-metrics: Update to v0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/862174 (https://phabricator.wikimedia.org/T323706) [06:58:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:00:31] RECOVERY - mediawiki-installation DSH group on mw1492 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:01:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P41855 and previous config saved to /var/cache/conftool/dbconfig/20221130-070155-ladsgroup.json [07:03:01] RECOVERY - mediawiki-installation DSH group on mw1489 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:07:31] (03PS1) 10Marostegui: data.yaml: Add Asaf Bartov to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/862175 (https://phabricator.wikimedia.org/T323911) [07:08:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:10:01] 10SRE, 10SRE-Access-Requests: Turnilo access request for User:wfan - https://phabricator.wikimedia.org/T324057 (10Marostegui) Hello @AnnWF can you please follow the proper template at: https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ [07:10:15] 10SRE, 10SRE-Access-Requests: Turnilo access request for User:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Marostegui) Hello @Damilare can you please follow the proper template at: https://phabricator.wikimedia.org/maniphest/task/edit/form/8/ [07:10:23] 10SRE, 10SRE-Access-Requests: Turnilo access request for User:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Marostegui) p:05Triage→03Medium [07:10:27] 10SRE, 10SRE-Access-Requests: Turnilo access request for User:wfan - https://phabricator.wikimedia.org/T324057 (10Marostegui) p:05Triage→03Medium [07:12:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] eventstreams: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860512 (owner: 10Giuseppe Lavagetto) [07:17:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T323907)', diff saved to https://phabricator.wikimedia.org/P41856 and previous config saved to /var/cache/conftool/dbconfig/20221130-071702-ladsgroup.json [07:17:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:17:11] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [07:17:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1170.eqiad.wmnet with reason: Maintenance [07:17:21] (03Merged) 10jenkins-bot: eventstreams: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860512 (owner: 10Giuseppe Lavagetto) [07:17:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41857 and previous config saved to /var/cache/conftool/dbconfig/20221130-071723-ladsgroup.json [07:24:39] (03PS4) 10Slyngshede: Allow multiple server connections to be defined. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860857 [07:25:17] RECOVERY - mediawiki-installation DSH group on mw1491 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:25:29] (03CR) 10Slyngshede: Allow multiple server connections to be defined. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860857 (owner: 10Slyngshede) [07:35:50] RECOVERY - mediawiki-installation DSH group on mw1495 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:35:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:38:14] RECOVERY - mediawiki-installation DSH group on mw1490 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:40:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:42:22] (03CR) 10Ayounsi: [C: 03+1] "Based on the Juniper article it makes sens to remove that filter and the change lgtm!" [homer/public] - 10https://gerrit.wikimedia.org/r/861896 (https://phabricator.wikimedia.org/T324033) (owner: 10Cathal Mooney) [07:43:25] (03CR) 10Giuseppe Lavagetto: [C: 03+2] push-notifications: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860704 (owner: 10Giuseppe Lavagetto) [07:47:22] RECOVERY - mediawiki-installation DSH group on mw1494 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:47:22] RECOVERY - mediawiki-installation DSH group on mw1496 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:47:22] RECOVERY - mediawiki-installation DSH group on mw1493 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:47:22] RECOVERY - mediawiki-installation DSH group on mw1497 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:47:22] RECOVERY - mediawiki-installation DSH group on mw1498 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [07:47:22] (03CR) 10Gergő Tisza: [C: 04-1] "Name needs update." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [07:47:35] (03PS1) 10Ryan Kemper: [WIP] add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) [07:47:54] (03CR) 10Gergő Tisza: [C: 04-1] "Or rather, this patch isn't needed anymore." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [07:48:38] (03Merged) 10jenkins-bot: push-notifications: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860704 (owner: 10Giuseppe Lavagetto) [07:54:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41858 and previous config saved to /var/cache/conftool/dbconfig/20221130-075454-ladsgroup.json [07:55:02] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [07:56:39] (03PS2) 10Ryan Kemper: [WIP] add grizzly dashboard for WDQS uptime [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/862178 (https://phabricator.wikimedia.org/T323064) [07:59:29] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10ayounsi) @BBlack it has been a year now, can we remove the "temporary" static rules now from the routers? I'd like to keep our config lean and I worry this gets forgotten. [08:00:05] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:19] * kart_ is here. [08:02:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861968 (https://phabricator.wikimedia.org/T323973) (owner: 10KartikMistry) [08:03:01] (03Merged) 10jenkins-bot: ContentTranslation: Disable machine translation for Japanese WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861968 (https://phabricator.wikimedia.org/T323973) (owner: 10KartikMistry) [08:03:18] !log kartik@deploy1002 Started scap: Backport for [[gerrit:861968|ContentTranslation: Disable machine translation for Japanese WP (T323973)]] [08:03:26] T323973: Disable machine translation for Japanese - https://phabricator.wikimedia.org/T323973 [08:04:29] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:861968|ContentTranslation: Disable machine translation for Japanese WP (T323973)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:06:19] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [08:06:22] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [08:07:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:07:50] !log dry-running GrowthExperiments refreshUserImpactData.php - T323958#8430768 [08:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:56] T323958: Evaluate AQS use of GrowthExperiments new impact dashboard - https://phabricator.wikimedia.org/T323958 [08:08:26] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:08:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:09:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:10:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P41859 and previous config saved to /var/cache/conftool/dbconfig/20221130-081000-ladsgroup.json [08:11:50] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:861968|ContentTranslation: Disable machine translation for Japanese WP (T323973)]] (duration: 08m 31s) [08:11:57] T323973: Disable machine translation for Japanese - https://phabricator.wikimedia.org/T323973 [08:12:15] hi [08:12:54] Amir1 / urbanecm I'd like to backport https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/861963. kart_ could you let me know when you're done, please? Or do you need someone to deploy? [08:13:19] hi kostajh! [08:13:28] kart_: are you done with your backport please? [08:13:41] scap logged "finished", so i think "yes", but... [08:14:19] (03PS1) 10Kosta Harlan: User impact: Make the URL opt-in override the config flag [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861480 (https://phabricator.wikimedia.org/T323526) [08:14:59] urbanecm: Done. [08:15:27] Please go ahead. [08:16:00] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: ICMPv6 'TTL Exceeded' messages are not generated by row E/F switches due to loopback filter - https://phabricator.wikimedia.org/T324033 (10ayounsi) Thanks! Patch reviewed, it's always great to remove unnecessary config! :) For the record a... [08:18:15] (03PS3) 10Kosta Harlan: GrowthExperiments: Remove unused GEAllowAccessToNewImpactModule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) [08:18:32] (03PS4) 10Kosta Harlan: GrowthExperiments: Remove unused GEAllowAccessToNewImpactModule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) [08:18:44] (03CR) 10Kosta Harlan: GrowthExperiments: Remove unused GEAllowAccessToNewImpactModule (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [08:18:51] kart_: thanks [08:19:14] urbanecm: do you want to backport or should I? (I don't mind doing it.) [08:21:44] 10SRE, 10Data-Engineering, 10Kubernetes: eventstreams cannot be deployed and its deployments will need to be destroyed and recreated - https://phabricator.wikimedia.org/T324074 (10Joe) p:05Triage→03Unbreak! [08:23:24] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/push-notifications: apply [08:23:39] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [08:23:49] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [08:24:14] going ahead with it [08:24:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kharlan@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861480 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [08:25:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P41860 and previous config saved to /var/cache/conftool/dbconfig/20221130-082507-ladsgroup.json [08:25:19] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [08:28:01] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860857 (owner: 10Slyngshede) [08:28:44] (03PS2) 10Muehlenhoff: elasticsearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860906 (https://phabricator.wikimedia.org/T308013) [08:29:11] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [08:29:50] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [08:31:17] (03CR) 10Marostegui: [C: 03+1] mediawiki: Add quarterly cleanup of flaggedtemplates table [puppet] - 10https://gerrit.wikimedia.org/r/861507 (https://phabricator.wikimedia.org/T290769) (owner: 10Ladsgroup) [08:31:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:32:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] toolhub: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860709 (owner: 10Giuseppe Lavagetto) [08:32:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] shellbox: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860705 (owner: 10Giuseppe Lavagetto) [08:32:51] (03CR) 10Muehlenhoff: [C: 03+2] elasticsearch: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/860906 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:35:49] (03CR) 10Gergő Tisza: [C: 03+1] GrowthExperiments: Remove unused GEAllowAccessToNewImpactModule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859545 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [08:36:45] (03Merged) 10jenkins-bot: toolhub: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860709 (owner: 10Giuseppe Lavagetto) [08:36:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:37:33] (03Merged) 10jenkins-bot: shellbox: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860705 (owner: 10Giuseppe Lavagetto) [08:40:13] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for dragonfly-supernode [puppet] - 10https://gerrit.wikimedia.org/r/861888 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:40:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41861 and previous config saved to /var/cache/conftool/dbconfig/20221130-084013-ladsgroup.json [08:40:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [08:40:22] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [08:40:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [08:40:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T323907)', diff saved to https://phabricator.wikimedia.org/P41862 and previous config saved to /var/cache/conftool/dbconfig/20221130-084034-ladsgroup.json [08:42:18] (03Merged) 10jenkins-bot: User impact: Make the URL opt-in override the config flag [extensions/GrowthExperiments] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861480 (https://phabricator.wikimedia.org/T323526) (owner: 10Kosta Harlan) [08:42:58] urbanecm: oh no, the scap application failed [08:42:58] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for prometheus-ipmi-exporter [puppet] - 10https://gerrit.wikimedia.org/r/860569 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:43:10] kostajh: how did it fail this time please? [08:43:16] just a sec [08:43:19] (ideally paste a complete traceback) [08:43:33] we're having issues since yesterday, unfortunately :-/ [08:43:54] urbanecm: https://phabricator.wikimedia.org/P41863 [08:44:02] I guess it is because wmf.12 is not yet deployed [08:44:47] oh, yeah [08:44:52] kostajh: you need to deploy this one manually [08:45:08] ie. git fetch, git rebase, scap sync-world 'message' [08:45:25] this is phabricatorized as T324060 [08:45:25] T324060: scap backport: KeyError: '/srv/mediawiki-staging/php-1.40.0-wmf.12' - https://phabricator.wikimedia.org/T324060 [08:47:05] ok [08:48:08] (03CR) 10Elukey: [C: 03+2] ml-services: bump docker image for draftquality (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/800732 (https://phabricator.wikimedia.org/T309102) (owner: 10Elukey) [08:48:46] urbanecm: is there a page with the exact commands? (or do you have time to do it, as I feel less comfortable with doing this part manually) [08:48:56] kostajh: sure, i can do it for you [08:49:03] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers perhaps [08:49:13] urbanecm: ok, thank you [08:49:16] (03CR) 10Cathal Mooney: [C: 03+2] Remove VRF-specific loopback filter from row E/F switches [homer/public] - 10https://gerrit.wikimedia.org/r/861896 (https://phabricator.wikimedia.org/T324033) (owner: 10Cathal Mooney) [08:49:31] https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers is the docs though [08:49:48] (03Merged) 10jenkins-bot: Remove VRF-specific loopback filter from row E/F switches [homer/public] - 10https://gerrit.wikimedia.org/r/861896 (https://phabricator.wikimedia.org/T324033) (owner: 10Cathal Mooney) [08:49:55] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:50:31] !log urbanecm@deploy1002 Started scap: 5b2f3ae4db0e2d88c90ef00d48f0673dc9ff83b5: User impact: Make the URL opt-in override the config flag (T323526) [08:50:36] syncing [08:50:37] T323526: New Impact Module: Start experiment for the new Impact module on Growth Pilot wikis (ar, bn, cs, es) - https://phabricator.wikimedia.org/T323526 [08:50:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:50:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:51:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:51:45] urbanecm: please LMK once you are done with the window [08:51:49] will do [08:52:25] 10SRE, 10API Platform, 10Foundational Technology Requests, 10Image-Suggestions, and 4 others: Public-facing API for image suggestions data - https://phabricator.wikimedia.org/T306349 (10kostajh) >>! In T306349#8429727, @VirginiaPoundstone wrote: > @LGoto this has API Platform sign off (via Bill's comment a... [08:53:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [08:54:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [08:54:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [08:54:56] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [08:55:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860520 (owner: 10Giuseppe Lavagetto) [08:56:04] !log urbanecm@deploy1002 Finished scap: 5b2f3ae4db0e2d88c90ef00d48f0673dc9ff83b5: User impact: Make the URL opt-in override the config flag (T323526) (duration: 05m 33s) [08:56:11] T323526: New Impact Module: Start experiment for the new Impact module on Growth Pilot wikis (ar, bn, cs, es) - https://phabricator.wikimedia.org/T323526 [08:56:12] kostajh: should be live. anything else? [08:56:24] (well, will be, once wmf.12 moves forward) [08:56:30] urbanecm: no that's all, thanks [08:56:36] okay [08:56:38] godog: over to you :) [08:56:57] urbanecm: cheers! [08:57:01] (03PS3) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [08:58:03] (03CR) 10CI reject: [V: 04-1] WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [08:58:06] right on time -- well done [08:59:28] 10SRE, 10Traffic-Icebox: HTTP 502 Error when trying to create new page (500k characters) on Romanian Wikisource - https://phabricator.wikimedia.org/T273623 (10Aklapper) 05Stalled→03Declined Unfortunately closing this Phabricator task as no further information has been provided. @Bogdan: If this problem sti... [08:59:48] (03CR) 10Muehlenhoff: [C: 03+2] create group for Release Engineering members [puppet] - 10https://gerrit.wikimedia.org/r/860836 (owner: 10Jaime Nuche) [09:00:31] (03Merged) 10jenkins-bot: mobileapps: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860520 (owner: 10Giuseppe Lavagetto) [09:01:23] heads up -- I'm starting the graphite/statsd failover to graphite1005 [09:01:38] (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: move writes to graphite1005 [dns] - 10https://gerrit.wikimedia.org/r/861357 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [09:01:42] (03PS2) 10Filippo Giunchedi: wmnet: move writes to graphite1005 [dns] - 10https://gerrit.wikimedia.org/r/861357 (https://phabricator.wikimedia.org/T318903) [09:01:59] (03CR) 10Filippo Giunchedi: [C: 03+2] stats: failover writes to graphite1005 [puppet] - 10https://gerrit.wikimedia.org/r/861359 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [09:02:05] (03PS2) 10Filippo Giunchedi: stats: failover writes to graphite1005 [puppet] - 10https://gerrit.wikimedia.org/r/861359 (https://phabricator.wikimedia.org/T318903) [09:02:23] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wikifeeds: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860513 (owner: 10Giuseppe Lavagetto) [09:04:35] !log flip dns and puppet for statsd/graphite - T318903 [09:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:42] T318903: Put graphite1005 in service - https://phabricator.wikimedia.org/T318903 [09:05:26] 10SRE, 10API Platform, 10Foundational Technology Requests, 10Image-Suggestions, and 4 others: Public-facing API for image suggestions data - https://phabricator.wikimedia.org/T306349 (10kostajh) [09:07:03] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:07:14] (03CR) 10Elukey: [C: 03+2] secrets: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860714 (owner: 10Giuseppe Lavagetto) [09:08:45] (03CR) 10Elukey: [C: 03+1] Configure the kube_env file for the spark-operator namespace [puppet] - 10https://gerrit.wikimedia.org/r/854505 (https://phabricator.wikimedia.org/T321686) (owner: 10Btullis) [09:10:13] 10SRE, 10Cloud-Services, 10Developer-Advocacy, 10Infrastructure-Foundations, 10LDAP: Create a single application to provision and manage developer (LDAP) accounts - https://phabricator.wikimedia.org/T179463 (10Aklapper) [09:11:01] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: ICMPv6 'TTL Exceeded' messages are not generated by row E/F switches due to loopback filter - https://phabricator.wikimedia.org/T324033 (10cmooney) 05Open→03Resolved Thanks for the review @ayounsi >>! In T324033#8430782, @ayounsi wrot... [09:11:08] (03CR) 10Elukey: [C: 03+2] knative-serving: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860715 (owner: 10Giuseppe Lavagetto) [09:12:03] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:12:41] (03CR) 10Elukey: [C: 03+2] kserve-inference: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860717 (owner: 10Giuseppe Lavagetto) [09:15:29] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: ICMPv6 'TTL Exceeded' messages are not generated by row E/F switches due to loopback filter - https://phabricator.wikimedia.org/T324033 (10taavi) [09:18:42] ok most of dns/puppet bits have propagated for statsd, I'll go ahead with the mw-config deployment [09:19:28] (03CR) 10Filippo Giunchedi: [C: 03+2] ProductionServices: move to graphite1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861361 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [09:19:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by filippo@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/861361 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [09:19:56] !log filippo@deploy1002 Started scap: Backport for [[gerrit:861361|ProductionServices: move to graphite1005 (T318903)]] [09:20:04] T318903: Put graphite1005 in service - https://phabricator.wikimedia.org/T318903 [09:20:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:20:59] !log filippo@deploy1002 filippo and filippo: Backport for [[gerrit:861361|ProductionServices: move to graphite1005 (T318903)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [09:20:59] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:21:00] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:21:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T323907)', diff saved to https://phabricator.wikimedia.org/P41864 and previous config saved to /var/cache/conftool/dbconfig/20221130-092137-ladsgroup.json [09:21:44] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [09:21:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:26:11] !log filippo@deploy1002 Finished scap: Backport for [[gerrit:861361|ProductionServices: move to graphite1005 (T318903)]] (duration: 06m 14s) [09:26:18] T318903: Put graphite1005 in service - https://phabricator.wikimedia.org/T318903 [09:26:36] 10SRE, 10API Platform, 10Foundational Technology Requests, 10Image-Suggestions, and 4 others: Public-facing API for image suggestions data - https://phabricator.wikimedia.org/T306349 (10Joe) Hi everyone, as I understand it, the public API for this service isn't just using the service itself, but rather the... [09:26:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [09:27:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [09:27:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [09:28:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [09:28:51] !log bounce navtiming on webperf1003 to pick up statsd changes - T318903 [09:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:06] !log bounce superset on an-tool1010 to pick up statsd changes - T247963 [09:30:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:12] T247963: Migrate role::graphite::production to Bullseye - https://phabricator.wikimedia.org/T247963 [09:32:48] !log bounce superset on an-tool1005 to pick up statsd changes - T247963 [09:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P41865 and previous config saved to /var/cache/conftool/dbconfig/20221130-093644-ladsgroup.json [09:50:55] (03PS2) 10Marostegui: data.yaml: Add Asaf Bartov to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/862175 (https://phabricator.wikimedia.org/T323911) [09:51:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P41866 and previous config saved to /var/cache/conftool/dbconfig/20221130-095150-ladsgroup.json [09:53:53] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/862175 (https://phabricator.wikimedia.org/T323911) (owner: 10Marostegui) [09:54:24] (03CR) 10Marostegui: [C: 03+2] data.yaml: Add Asaf Bartov to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/862175 (https://phabricator.wikimedia.org/T323911) (owner: 10Marostegui) [09:55:39] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for abartov - https://phabricator.wikimedia.org/T323911 (10Marostegui) 05Open→03Resolved a:05SRamkisson→03Marostegui This is done - please give it 30-60 minutes for the change to spread everywhere. [09:56:15] 10SRE, 10Data-Engineering, 10Kubernetes: eventstreams cannot be deployed and its deployments will need to be destroyed and recreated - https://phabricator.wikimedia.org/T324074 (10BTullis) a:03BTullis [09:58:16] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [09:58:56] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [09:58:56] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 103 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:59:05] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [09:59:10] (03PS2) 10Giuseppe Lavagetto: wikifeeds: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860513 [09:59:28] (03CR) 10Elukey: "I am very ignorant about the Go code related to controllers, but this change makes a lot of sense to me. I added some comments here and th" [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [09:59:30] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [09:59:56] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:00:04] godog: May I have your attention please! Failover to graphite1005. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1000) [10:00:21] (03CR) 10Elukey: [C: 03+1] helm-state-metrics: Update to v0.2.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/862173 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [10:00:52] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:01:06] (03CR) 10Elukey: [C: 03+1] helm-state-metrics: Update to v0.2.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/862174 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [10:02:19] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:03:38] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "This is obsolete since I1f9fdb6 and I20128de and can be abandoned." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740765 (owner: 10Awight) [10:03:40] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:04:20] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:04:53] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:05:43] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:06:50] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:06:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T323907)', diff saved to https://phabricator.wikimedia.org/P41867 and previous config saved to /var/cache/conftool/dbconfig/20221130-100657-ladsgroup.json [10:06:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [10:07:09] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [10:07:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [10:07:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T323907)', diff saved to https://phabricator.wikimedia.org/P41868 and previous config saved to /var/cache/conftool/dbconfig/20221130-100718-ladsgroup.json [10:09:08] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:09:16] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:10:02] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:22:49] (03PS4) 10Elukey: knative-serving: improve chart's dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/861399 (https://phabricator.wikimedia.org/T303279) [10:23:24] (03CR) 10Elukey: "I added an explicit bump of the Chart version to track that something happened :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/861399 (https://phabricator.wikimedia.org/T303279) (owner: 10Elukey) [10:24:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T323907)', diff saved to https://phabricator.wikimedia.org/P41869 and previous config saved to /var/cache/conftool/dbconfig/20221130-102423-ladsgroup.json [10:24:31] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [10:25:46] (03CR) 10Marostegui: [C: 03+1] admin: add dasm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/860132 (https://phabricator.wikimedia.org/T322591) (owner: 10Andrea Denisse) [10:27:09] (03CR) 10David Caro: "This broke couldinfra idp puppet runs, looking" [puppet] - 10https://gerrit.wikimedia.org/r/861929 (owner: 10Jbond) [10:29:08] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] Rely on the default value for $wgFileExporterTarget (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/762392 (owner: 10Awight) [10:30:18] (03PS1) 10Btullis: Remove chartid from matchlabels for eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/862222 (https://phabricator.wikimedia.org/T324074) [10:30:53] (03CR) 10David Caro: apereo_cas: fix delegated authentication config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861929 (owner: 10Jbond) [10:31:05] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Remove configuration which is the same as the extension's default (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/779014 (owner: 10Awight) [10:35:20] 10SRE, 10Data-Engineering-Planning, 10Data Pipelines (Sprint 05-06), 10Kubernetes, 10Patch-For-Review: eventstreams cannot be deployed and its deployments will need to be destroyed and recreated - https://phabricator.wikimedia.org/T324074 (10BTullis) Adding various Data Engineering planning and streaming... [10:36:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:37:12] (03CR) 10Clément Goubert: "I think you need to bump the chart version too." [deployment-charts] - 10https://gerrit.wikimedia.org/r/862222 (https://phabricator.wikimedia.org/T324074) (owner: 10Btullis) [10:38:22] (03PS5) 10David Caro: wmcs.create_instance_with_prefix: Add a sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 [10:38:36] (03CR) 10CI reject: [V: 04-1] wmcs.create_instance_with_prefix: Add a sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 (owner: 10David Caro) [10:39:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P41871 and previous config saved to /var/cache/conftool/dbconfig/20221130-103929-ladsgroup.json [10:40:01] (03PS6) 10David Caro: wmcs.create_instance_with_prefix: Add a sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 [10:41:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:42:56] (03CR) 10CI reject: [V: 04-1] wmcs.create_instance_with_prefix: Add a sec group default [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/841089 (owner: 10David Caro) [10:43:09] 10SRE, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Kubernetes, 10Patch-For-Review: eventstreams cannot be deployed and its deployments will need to be destroyed and recreated - https://phabricator.wikimedia.org/T324074 (10BTullis) [10:44:24] (03PS2) 10Btullis: Remove chartid from matchlabels for eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/862222 (https://phabricator.wikimedia.org/T324074) [10:44:50] (03CR) 10Btullis: Remove chartid from matchlabels for eventstreams (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/862222 (https://phabricator.wikimedia.org/T324074) (owner: 10Btullis) [10:45:08] (03CR) 10Elukey: [C: 03+2] knative-serving: improve chart's dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/861399 (https://phabricator.wikimedia.org/T303279) (owner: 10Elukey) [10:50:39] (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: move alerts to graphite1005 [puppet] - 10https://gerrit.wikimedia.org/r/861358 (https://phabricator.wikimedia.org/T318903) (owner: 10Filippo Giunchedi) [10:53:33] (03CR) 10Clément Goubert: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/862222 (https://phabricator.wikimedia.org/T324074) (owner: 10Btullis) [10:53:49] (03CR) 10Btullis: [C: 03+2] Remove chartid from matchlabels for eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/862222 (https://phabricator.wikimedia.org/T324074) (owner: 10Btullis) [10:54:26] (03PS1) 10Filippo Giunchedi: decom graphite1004 [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) [10:54:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P41872 and previous config saved to /var/cache/conftool/dbconfig/20221130-105436-ladsgroup.json [10:57:16] PROBLEM - SSH on mw1312.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:58:19] (03Merged) 10jenkins-bot: Remove chartid from matchlabels for eventstreams [deployment-charts] - 10https://gerrit.wikimedia.org/r/862222 (https://phabricator.wikimedia.org/T324074) (owner: 10Btullis) [10:58:49] 10SRE, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Kubernetes, 10Patch-For-Review: eventstreams cannot be deployed and its deployments will need to be destroyed and recreated - https://phabricator.wikimedia.org/T324074 (10BTullis) I have received guidance from @Clement_Goubert on the st... [11:00:51] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [11:01:42] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [11:02:10] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [11:03:10] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [11:05:17] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [11:05:19] (03CR) 10Filippo Giunchedi: "To be merged some time next week once we know all is good with graphite1005 in service" [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi) [11:06:15] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [11:06:48] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [11:07:35] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [11:07:55] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [11:08:00] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [11:08:16] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [11:08:29] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (UPDATE clusterissuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:08:38] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [11:08:49] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [11:09:06] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [11:09:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T323907)', diff saved to https://phabricator.wikimedia.org/P41873 and previous config saved to /var/cache/conftool/dbconfig/20221130-110942-ladsgroup.json [11:09:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [11:09:49] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [11:09:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [11:10:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T323907)', diff saved to https://phabricator.wikimedia.org/P41874 and previous config saved to /var/cache/conftool/dbconfig/20221130-111003-ladsgroup.json [11:11:28] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [11:11:39] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [11:11:50] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [11:12:14] !log btullis@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=eventstreams,name=codfw [11:12:35] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [11:12:44] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [11:13:03] (03PS3) 10Stevemunene: Add an-presto1006 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) [11:13:24] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [11:13:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (UPDATE clusterissuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:14:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for vpoundstone - WMF - https://phabricator.wikimedia.org/T314676 (10Marostegui) @VirginiaPoundstone you were not in the WMF LDAP group. I just added you. Can you retry? [11:14:51] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [11:14:53] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [11:15:24] (03CR) 10Elukey: Add an-presto1006 to presto cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [11:17:09] I'm going to be testing some SAL logging improvements for helmfile on a mw k8s deployment (mw-jobrunner), so disregard the possible noise [11:19:27] 10SRE, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Kubernetes, 10Patch-For-Review: eventstreams cannot be deployed and its deployments will need to be destroyed and recreated - https://phabricator.wikimedia.org/T324074 (10BTullis) New eventstreams clients in codfw have virtually stopped... [11:19:53] !log upgrade puppetdb1003 to bookworm T321783 [11:19:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:00] T321783: Setup an initial bookworm host pair with Puppetdb 7 - https://phabricator.wikimedia.org/T321783 [11:21:49] (03CR) 10Slyngshede: ldap:management rewrite modify-mfa to use Bitu. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861385 (owner: 10Slyngshede) [11:25:40] (03PS1) 10David Caro: wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) [11:27:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T323907)', diff saved to https://phabricator.wikimedia.org/P41875 and previous config saved to /var/cache/conftool/dbconfig/20221130-112726-ladsgroup.json [11:27:34] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [11:28:23] (03CR) 10CI reject: [V: 04-1] wmcs: add cookbook to create a project [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro) [11:31:18] 10SRE, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Kubernetes: eventstreams cannot be deployed and its deployments will need to be destroyed and recreated - https://phabricator.wikimedia.org/T324074 (10BTullis) eventstreams in codfw has been completely drained. {F35825309,width=60%} Proce... [11:32:34] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [11:32:55] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [11:35:11] !log btullis@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=eventstreams,name=codfw [11:40:07] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [11:40:25] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [11:41:16] (03PS2) 10Arturo Borrero Gonzalez: cloudlb: openstack: haproxy: introduce hiera config hash [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [11:42:32] (03CR) 10FNegri: [C: 03+1] "Nice! One small thing I noticed is that the inherited help string for "--project" is a bit misleading for this cookbook. Maybe it could be" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro) [11:42:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P41876 and previous config saved to /var/cache/conftool/dbconfig/20221130-114233-ladsgroup.json [11:43:26] (03CR) 10CI reject: [V: 04-1] cloudlb: openstack: haproxy: introduce hiera config hash [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [11:44:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb1003.eqiad.wmnet [11:44:34] (03PS1) 10Hnowlan: admin_ng: set thumbor max memory limit higher [deployment-charts] - 10https://gerrit.wikimedia.org/r/862230 (https://phabricator.wikimedia.org/T233196) [11:46:11] 10SRE, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Kubernetes: eventstreams cannot be deployed and its deployments will need to be destroyed and recreated - https://phabricator.wikimedia.org/T324074 (10BTullis) eventreams in codfw is now handling traffic again nicely. {F35825322,width=60%... [11:46:54] (03CR) 10David Caro: wmcs: add cookbook to create a project (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862229 (https://phabricator.wikimedia.org/T323952) (owner: 10David Caro) [11:47:10] (03CR) 10CI reject: [V: 04-1] admin_ng: set thumbor max memory limit higher [deployment-charts] - 10https://gerrit.wikimedia.org/r/862230 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:47:39] (03PS4) 10Stevemunene: Add an-presto1006 to presto cluster [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) [11:48:19] (03CR) 10Jbond: [V: 03+1 C: 03+2] apero_cas: (WIP) add addtional paramas for OIDC [puppet] - 10https://gerrit.wikimedia.org/r/858362 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [11:48:33] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [11:49:51] (03CR) 10Stevemunene: Add an-presto1006 to presto cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [11:49:55] (03PS1) 10Clément Goubert: mw-jobrunner: Better SAL logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/862232 (https://phabricator.wikimedia.org/T303900) [11:50:12] (03CR) 10FNegri: "I have a few questions:" [puppet] - 10https://gerrit.wikimedia.org/r/860627 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [11:50:20] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [11:50:32] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [11:50:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetdb1003.eqiad.wmnet [11:52:47] (03PS2) 10Clément Goubert: mw-jobrunner: Better SAL logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/862232 (https://phabricator.wikimedia.org/T303900) [11:53:06] !log btullis@cumin1001 START - Cookbook sre.discovery.service-route [11:53:09] !log btullis@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [11:53:22] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [11:53:25] 10SRE, 10Infrastructure-Foundations: Puppet should support VERSION_CODENAME to detect a distro - https://phabricator.wikimedia.org/T321906 (10MoritzMuehlenhoff) 05Open→03Declined John's patches linked to the task work great and updating to a "new" testing is now practically no effort, so we've decided to n... [11:56:34] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [11:56:51] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [11:57:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P41877 and previous config saved to /var/cache/conftool/dbconfig/20221130-115739-ladsgroup.json [11:57:43] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [11:58:29] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [11:58:45] !log btullis@cumin1001 START - Cookbook sre.discovery.service-route [11:59:39] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [12:00:25] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [12:03:49] !log btullis@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [12:04:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:05:30] 10SRE, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Kubernetes: eventstreams cannot be deployed and its deployments will need to be destroyed and recreated - https://phabricator.wikimedia.org/T324074 (10BTullis) Depooled eqiad using the cookbook method, which was new to me. ` btullis@cumin... [12:07:08] (03PS1) 10Muehlenhoff: Extend access for dani [puppet] - 10https://gerrit.wikimedia.org/r/862235 [12:08:44] (03PS2) 10Hnowlan: admin_ng: set thumbor max memory limit higher [deployment-charts] - 10https://gerrit.wikimedia.org/r/862230 (https://phabricator.wikimedia.org/T233196) [12:09:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:10:44] RECOVERY - swift eqiad object availability low on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad [12:12:28] (03PS3) 10FNegri: WIP: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [12:12:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T323907)', diff saved to https://phabricator.wikimedia.org/P41879 and previous config saved to /var/cache/conftool/dbconfig/20221130-121246-ladsgroup.json [12:12:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:12:54] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [12:13:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [12:13:12] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for dani [puppet] - 10https://gerrit.wikimedia.org/r/862235 (owner: 10Muehlenhoff) [12:14:00] (03CR) 10CI reject: [V: 04-1] WIP: idea for cloud cumin::target [puppet] - 10https://gerrit.wikimedia.org/r/861855 (https://phabricator.wikimedia.org/T323483) (owner: 10Jbond) [12:15:29] (03PS1) 10Jbond: P:idp: move flask test script to a file [puppet] - 10https://gerrit.wikimedia.org/r/862237 [12:17:36] (03CR) 10CI reject: [V: 04-1] P:idp: move flask test script to a file [puppet] - 10https://gerrit.wikimedia.org/r/862237 (owner: 10Jbond) [12:18:39] (03CR) 10Muehlenhoff: [C: 03+1] "These looks good to me. Note that the systemd unit shipped by upstream already has come - by default commented out - hardening statements," [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [12:19:11] (03PS2) 10Jbond: P:idp: move flask test script to a file [puppet] - 10https://gerrit.wikimedia.org/r/862237 [12:24:14] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [12:24:19] (03PS1) 10Stevemunene: Add dummy keytabs for new presto1006-1015 servers [labs/private] - 10https://gerrit.wikimedia.org/r/862240 (https://phabricator.wikimedia.org/T323783) [12:24:30] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [12:25:36] (03CR) 10Jbond: [C: 03+2] P:idp: move flask test script to a file [puppet] - 10https://gerrit.wikimedia.org/r/862237 (owner: 10Jbond) [12:26:54] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [12:27:20] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [12:29:22] !log btullis@cumin1001 START - Cookbook sre.discovery.service-route [12:30:02] (03PS2) 10KartikMistry: Update cxserver to 2022-11-28-053412-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/861195 (https://phabricator.wikimedia.org/T323825) [12:30:08] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/862241 [12:32:59] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/862241 (owner: 10Muehlenhoff) [12:34:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [12:35:54] PROBLEM - SSH on mw1330.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:35:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:36:09] * kart_ is depoying cxserver in few minutes. Nothing major. [12:36:58] 10SRE, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Kubernetes: eventstreams cannot be deployed and its deployments will need to be destroyed and recreated - https://phabricator.wikimedia.org/T324074 (10BTullis) All looks good with eventstreams in eqiad again. {F35825355,width=60%} [12:37:07] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2022-11-28-053412-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/861195 (https://phabricator.wikimedia.org/T323825) (owner: 10KartikMistry) [12:40:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:41:37] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [labs/private] - 10https://gerrit.wikimedia.org/r/862240 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [12:42:07] (03Merged) 10jenkins-bot: Update cxserver to 2022-11-28-053412-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/861195 (https://phabricator.wikimedia.org/T323825) (owner: 10KartikMistry) [12:44:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [12:45:07] 10SRE, 10SRE-Access-Requests: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10matthiasmullie) [12:45:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [12:46:00] 10SRE, 10SRE-Access-Requests: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Marostegui) [12:46:10] (03CR) 10Stevemunene: [C: 03+2] Add dummy keytabs for new presto1006-1015 servers [labs/private] - 10https://gerrit.wikimedia.org/r/862240 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [12:46:27] (03PS1) 10Matthias Mullie: Add mlitn to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/862245 (https://phabricator.wikimedia.org/T324101) [12:46:43] (03CR) 10Stevemunene: [V: 03+2 C: 03+2] Add dummy keytabs for new presto1006-1015 servers [labs/private] - 10https://gerrit.wikimedia.org/r/862240 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [12:46:53] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10matthiasmullie) [12:47:54] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:48:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Marostegui) Confirmed L3 is signed. @MarkTraceur we need your approval for this. [12:48:18] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:48:32] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:48:51] 10SRE, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Kubernetes: eventstreams cannot be deployed and its deployments will need to be destroyed and recreated - https://phabricator.wikimedia.org/T324074 (10Ottomata) Thank you both! [12:50:20] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 1 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:51:07] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38532/console" [puppet] - 10https://gerrit.wikimedia.org/r/861368 (https://phabricator.wikimedia.org/T323783) (owner: 10Stevemunene) [12:52:10] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [12:52:21] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [12:53:10] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [12:54:01] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [12:54:38] PROBLEM - SSH on mw1326.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:54:45] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:55:36] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [12:55:37] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:55:49] (03CR) 10Btullis: [C: 03+2] Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [12:56:05] (03CR) 10Btullis: [C: 03+2] Modify jupyterhub config to point to conda-analytics instead of anaconda-wmf. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/843959 (https://phabricator.wikimedia.org/T321088) (owner: 10Xcollazo) [12:56:12] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [12:57:21] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:58:01] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:58:15] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:58:42] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:59:07] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:59:15] !log Updated cxserver to 2022-11-28-053412-production (T323825, T319177) [12:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:23] T319177: Enable Section Translation on 6 Wikipedias where Content Translation is available by default - https://phabricator.wikimedia.org/T319177 [12:59:23] T323825: Enable Content and Section translation on 10 Wikipedias - https://phabricator.wikimedia.org/T323825 [12:59:57] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [13:01:09] (03PS3) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [13:01:21] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [13:02:09] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [13:03:17] (03CR) 10CI reject: [V: 04-1] cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) (owner: 10Arturo Borrero Gonzalez) [13:04:11] (03PS1) 10Guergana Tzatchkova: Add Property (120) to Wikidata content Namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) [13:04:44] PROBLEM - Disk space on stat1004 is CRITICAL: DISK CRITICAL - free space: / 144 MB (0% inode=79%): /tmp 144 MB (0% inode=79%): /var/tmp 144 MB (0% inode=79%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1004&var-datasource=eqiad+prometheus/ops [13:04:47] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:05:51] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:06:43] btullis: FYI see stat1004 out of space ^ [13:07:05] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [13:07:20] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38534/console" [puppet] - 10https://gerrit.wikimedia.org/r/861412 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [13:14:12] (03PS1) 10Muehlenhoff: Add Hiera settings for second bookworm puppetdb pair [puppet] - 10https://gerrit.wikimedia.org/r/862256 (https://phabricator.wikimedia.org/T321783) [13:15:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Marostegui) p:05Triage→03Medium [13:15:57] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] thanos: add thanos-web to catalog and frontend [puppet] - 10https://gerrit.wikimedia.org/r/861412 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [13:16:03] (03PS3) 10Filippo Giunchedi: thanos: add thanos-web to catalog and frontend [puppet] - 10https://gerrit.wikimedia.org/r/861412 (https://phabricator.wikimedia.org/T323913) [13:16:14] RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:18:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [13:18:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [13:18:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T323907)', diff saved to https://phabricator.wikimedia.org/P41880 and previous config saved to /var/cache/conftool/dbconfig/20221130-131854-ladsgroup.json [13:19:02] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [13:20:27] I'd like to quickly reboot grafana1002 to get the latest kernel, any objections [13:20:37] cc volans _joe_ as oncalls ^ [13:20:50] <_joe_> +1 [13:21:02] ack thanks [13:21:03] ack, no objections [13:21:16] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host grafana1002.eqiad.wmnet [13:21:53] not going to say anything on how long it'll take or I'll jinx(er) it [13:21:56] (03CR) 10David Caro: harbor: remove unused harbor::db module/role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860627 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [13:23:05] (we're back) [13:23:17] (03CR) 10Slyngshede: [C: 03+1] Allow multiple server connections to be defined. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860857 (owner: 10Slyngshede) [13:23:23] (03CR) 10Slyngshede: [V: 03+2] Allow multiple server connections to be defined. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860857 (owner: 10Slyngshede) [13:23:28] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Allow multiple server connections to be defined. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/860857 (owner: 10Slyngshede) [13:25:42] (03PS1) 10Filippo Giunchedi: hiera: move thanos-web to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/862258 (https://phabricator.wikimedia.org/T323913) [13:27:55] (03PS1) 10Muehlenhoff: postgresql: Add bookworm support [puppet] - 10https://gerrit.wikimedia.org/r/862260 (https://phabricator.wikimedia.org/T321783) [13:29:56] (03PS4) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [13:30:04] Daimona, HouseOfM, and cmelo: (Dis)respected human, time to deploy Create schema for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1330). Please do the needful. [13:30:35] o/ [13:31:31] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38535/console" [puppet] - 10https://gerrit.wikimedia.org/r/862258 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [13:31:32] o/ [13:31:41] !log filippo@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host grafana1002.eqiad.wmnet [13:32:14] (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] hiera: move thanos-web to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/862258 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [13:33:10] cheers vgutierrez [13:33:35] Alright, so I'm gonna check that the command is correct [13:33:35] godog: lvs1020 && lvs2010 are the secondary LVS [13:33:54] lvs1019 and lvs2009 the involved primaries [13:34:14] vgutierrez: *ack* thanks, I just noticed a deployment window open, I'll wait for that to end and then merge/deploy [13:34:23] (03PS2) 10Klausman: API GW: add config for addtional LW inference services [deployment-charts] - 10https://gerrit.wikimedia.org/r/861401 (https://phabricator.wikimedia.org/T323916) [13:34:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T323907)', diff saved to https://phabricator.wikimedia.org/P41881 and previous config saved to /var/cache/conftool/dbconfig/20221130-133436-ladsgroup.json [13:34:45] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [13:35:06] (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_codfw_thanos-web.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:35:11] (03PS3) 10Klausman: API GW: add config for addtional LW inference services [deployment-charts] - 10https://gerrit.wikimedia.org/r/861401 (https://phabricator.wikimedia.org/T323916) [13:35:22] (03PS1) 10Jbond: idp: rename property [puppet] - 10https://gerrit.wikimedia.org/r/862261 [13:35:24] (03PS1) 10Jbond: P:idp::standalone: create virtual env for idp-test-login app [puppet] - 10https://gerrit.wikimedia.org/r/862262 [13:35:32] Could someone please point me to the list of DB hosts? I'd like to make sure I'm connecting to the right one. [13:35:43] (03CR) 10Jbond: [C: 03+2] idp: rename property [puppet] - 10https://gerrit.wikimedia.org/r/862261 (owner: 10Jbond) [13:36:12] (03CR) 10Klausman: "I've integrated the changes, and upon suggestion by Luca made the regexen slightly more strict. I don't forsee that to be a particular ave" [deployment-charts] - 10https://gerrit.wikimedia.org/r/861401 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman) [13:37:21] (03CR) 10Muehlenhoff: "Script looks good, two remaining comments on the Puppet integration." [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [13:38:15] (03CR) 10CI reject: [V: 04-1] P:idp::standalone: create virtual env for idp-test-login app [puppet] - 10https://gerrit.wikimedia.org/r/862262 (owner: 10Jbond) [13:38:44] (03CR) 10Vgutierrez: [C: 03+1] "small nitpick inline, looks good" [puppet] - 10https://gerrit.wikimedia.org/r/861850 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [13:39:10] !log filippo@cumin1001 conftool action : set/pooled=yes:weight=100; selector: service=thanos-web [13:39:31] HouseOfM, cmelo: does any of you happen to have the link from last time? [13:39:45] I'm search but can't see it [13:39:50] (03CR) 10Muehlenhoff: Add mlitn to analytics-platform-eng-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/862245 (https://phabricator.wikimedia.org/T324101) (owner: 10Matthias Mullie) [13:40:08] (03PS2) 10Jbond: P:idp::standalone: create virtual env for idp-test-login app [puppet] - 10https://gerrit.wikimedia.org/r/862262 [13:40:15] Yeah, neither do I [13:41:11] I don't [13:41:39] <_joe_> Daimona: https://noc.wikimedia.org/db.php [13:41:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:42:02] Thanks :) [13:42:17] (03CR) 10CI reject: [V: 04-1] P:idp::standalone: create virtual env for idp-test-login app [puppet] - 10https://gerrit.wikimedia.org/r/862262 (owner: 10Jbond) [13:42:28] I see x1 is not there, but I did find https://gerrit.wikimedia.org/g/operations/puppet/+/26d0405410e85043e6296d12778e50c2c48e358c/hieradata/hosts/db1120.yaml, and since it says I'm connected to db1120, we should be good [13:44:06] Also, I can see the tables listed in https://wikitech.wikimedia.org/wiki/MariaDB#Extension_storage, so looking good. [13:44:06] (03Abandoned) 10Jgiannelos: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/857046 (owner: 10PipelineBot) [13:44:23] (03Abandoned) 10Jgiannelos: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/857047 (owner: 10PipelineBot) [13:44:51] I'm going ahead then [13:45:05] (ConfdResourceFailed) resolved: (2) confd resource _srv_config-master_pybal_codfw_thanos-web.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:45:43] !log Creating schema for the CampaignEvents extension in x1 wikishared # T322745 [13:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:50] T322745: Enable CampaignEvents extension on Meta-wiki - https://phabricator.wikimedia.org/T322745 [13:46:20] (03PS4) 10Klausman: API GW: add config for addtional LW inference services [deployment-charts] - 10https://gerrit.wikimedia.org/r/861401 (https://phabricator.wikimedia.org/T323916) [13:46:27] PROBLEM - Check systemd state on planet1002 is CRITICAL: CRITICAL - degraded: The following units failed: user@113.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:37] * Lucas_WMDE will be afk for the beginning of the backport window btw [13:47:08] Aaaaaand I can see the new tables on a replica with SHOW TABLES. So I guess we're done. [13:47:27] (03PS5) 10Arturo Borrero Gonzalez: cloudlb: introduce role skeleton [puppet] - 10https://gerrit.wikimedia.org/r/861902 (https://phabricator.wikimedia.org/T297596) [13:48:07] Great, thanks Daimona [13:48:22] (03CR) 10Filippo Giunchedi: varnish: teach confd-reload-vcl to write a Prometheus state file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861850 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [13:48:41] jouncebot: now and next [13:48:41] For the next 0 hour(s) and 41 minute(s): Create schema for the CampaignEvents extension (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1330) [13:48:46] Yup, everything's looking good. I think we're officially done now. [13:48:53] Daimona: ack, thanks! [13:49:11] jouncebot: next [13:49:11] In 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1400) [13:49:22] Thanks @Daimona! [13:49:39] Also, wait, this window was only supposed to be 30 mins long, sorry :) [13:49:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P41882 and previous config saved to /var/cache/conftool/dbconfig/20221130-134943-ladsgroup.json [13:49:46] uughhh ok I'll wait out the afternoon window [13:49:57] (to deploy my pybal-affecting change) [13:50:28] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/861850 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [13:50:35] (03PS3) 10Filippo Giunchedi: varnish: teach confd-reload-vcl to write a Prometheus state file [puppet] - 10https://gerrit.wikimedia.org/r/861850 (https://phabricator.wikimedia.org/T314118) [13:51:08] (03PS3) 10Jbond: P:idp::standalone: create virtual env for idp-test-login app [puppet] - 10https://gerrit.wikimedia.org/r/862262 [13:53:21] PROBLEM - SSH on db1122.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:53:33] (03CR) 10jenkins-bot: P:idp::standalone: create virtual env for idp-test-login app [puppet] - 10https://gerrit.wikimedia.org/r/862262 (owner: 10Jbond) [13:55:24] (03PS1) 10Urbanecm: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862264 (https://phabricator.wikimedia.org/T324105) [13:56:11] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10SRE Observability (FY2022/2023-Q2), 10User-fgiunchedi: Reduce IRC flood/spam during incidents - https://phabricator.wikimedia.org/T314118 (10fgiunchedi) [13:57:53] PROBLEM - Confd vcl based reload on cp2042 is CRITICAL: reload-vcl failed to run since 179h, 11 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:58:06] (03PS4) 10Jbond: P:idp::standalone: create virtual env for idp-test-login app [puppet] - 10https://gerrit.wikimedia.org/r/862262 [13:58:33] PROBLEM - Confd vcl based reload on cp3051 is CRITICAL: reload-vcl failed to run since 1150h, 11 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:58:37] PROBLEM - Confd vcl based reload on cp2030 is CRITICAL: reload-vcl failed to run since 179h, 12 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:58:45] PROBLEM - Confd vcl based reload on cp6012 is CRITICAL: reload-vcl failed to run since 479h, 8 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:58:48] Daimona: hey, ad https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/859634/, I'm not sure...is there a specific reason why a list of users in the config file was used? [13:59:04] hmmm [13:59:12] godog: ^^ related to your change?= [13:59:33] vgutierrez: uughhh yeah quite possible, thank you I'll take a look [13:59:45] (03PS5) 10Jbond: P:idp::standalone: create virtual env for idp-test-login app [puppet] - 10https://gerrit.wikimedia.org/r/862262 [13:59:49] (03CR) 10Urbanecm: Configure the CampaignEvents ext to use the x1.wikishared db for meta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859635 (https://phabricator.wikimedia.org/T322745) (owner: 10Daimona Eaytoy) [13:59:52] PROBLEM - Confd vcl based reload on cp6004 is CRITICAL: reload-vcl failed to run since 672h, 8 minutes. https://wikitech.wikimedia.org/wiki/Varnish [13:59:58] PROBLEM - Confd vcl based reload on cp3054 is CRITICAL: reload-vcl failed to run since 1150h, 14 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1400). [14:00:05] Urbanecm, Daimona, HouseOfM, and cmelo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:06] godog: feel free to disable puppet on A:cp [14:00:06] @urbanecm are you asking about the specific implementation? [14:00:15] vgutierrez: good idea, will do [14:00:33] HouseOfM: I'm asking about the usage of the hook at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/859634/. it feels like an use case for user groups to me. [14:00:41] it doesn't feel right to have this in server configuration [14:00:51] since jouncebot called: I can deploy today :) [14:01:10] godog: actually, is it ok to proceed while you're investigating the alerts? [14:01:12] PROBLEM - Confd vcl based reload on cp6015 is CRITICAL: reload-vcl failed to run since 479h, 11 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:01:14] It's temporary, user groups are coming [14:01:18] PROBLEM - Confd vcl based reload on cp1090 is CRITICAL: reload-vcl failed to run since 1150h, 22 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:01:21] urbanecm: yes, go ahead [14:01:24] okay, thanks [14:01:26] it's as monitoring artifact [14:01:30] *a [14:01:39] gotcha [14:01:44] PROBLEM - Confd vcl based reload on cp6007 is CRITICAL: reload-vcl failed to run since 672h, 10 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:01:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:02:36] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862264 (https://phabricator.wikimedia.org/T324105) (owner: 10Urbanecm) [14:03:09] urbanecm: Hey, nice to see you here :) As HouseOfM said, user groups might be the final implementation. The reason why we didn't do it just yet is that we're still in the phase of determining what the criteria will be, especially because the extension uses global accounts, so we may need a global user group, etc. [14:03:28] (03Merged) 10jenkins-bot: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862264 (https://phabricator.wikimedia.org/T324105) (owner: 10Urbanecm) [14:03:28] the irony of the fact that I'm fixing the irc alert spam while causing more spam is not lost on me [14:03:44] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:862264|Add new throttle rule (T324105)]] [14:03:44] PROBLEM - Confd vcl based reload on cp6014 is CRITICAL: reload-vcl failed to run since 479h, 13 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:03:51] T324105: Request a throttle lift for an editaton - 2022-11-30 - https://phabricator.wikimedia.org/T324105 [14:03:59] So for now we'll be using a list of beta testers (to be expanded in another config change in a couple weeks), until we'll have the final criteria [14:04:12] Daimona: would it make sense to create an user group now, and _not_ allow anyone to change the group members? that'd use a nicer syntax, while allowing you to figure out who should be able to manage the group [14:04:38] Trust and Safety and Stewards can change members of any user group (regardless of wgAddGroups/wgRemoveGroups) [14:04:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P41883 and previous config saved to /var/cache/conftool/dbconfig/20221130-140450-ladsgroup.json [14:04:57] Maybe, but then I think log entries etc. would stay around forever? [14:05:07] well, yes, but that's a good thing, isn't it? [14:05:27] Even for a temporary solution? [14:05:50] (03PS2) 10Filippo Giunchedi: WIP mediawiki: remove PHP7 icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/841887 (https://phabricator.wikimedia.org/T314118) [14:05:52] (03PS1) 10Filippo Giunchedi: varnish: check vcl reload for old and new state [puppet] - 10https://gerrit.wikimedia.org/r/862266 (https://phabricator.wikimedia.org/T314118) [14:06:02] And especially if we'll end up //not// using a user group for the final implementation (which we still don't know) [14:06:24] Daimona: I think so. it allows to easily see who the beta tester are and were, which can come helpful later too, when presenting the project. [14:07:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:07:26] (03CR) 10Daimona Eaytoy: Configure the CampaignEvents ext to use the x1.wikishared db for meta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859635 (https://phabricator.wikimedia.org/T322745) (owner: 10Daimona Eaytoy) [14:07:48] (03CR) 10Urbanecm: Configure the CampaignEvents ext to use the x1.wikishared db for meta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859635 (https://phabricator.wikimedia.org/T322745) (owner: 10Daimona Eaytoy) [14:08:28] We do have this information available, let me find it. I agree it's not as nice as a user group, but it also has no BC cost for the future [14:09:05] Here you go, I should have probably linked this in the patch: https://meta.wikimedia.org/wiki/Campaigns/Foundation_Product_Team/Registration/V1_Summary#Organizer_testers [14:09:17] And the list will be expanded with some new people once we have the names [14:09:21] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:862264|Add new throttle rule (T324105)]] (duration: 05m 37s) [14:09:28] T324105: Request a throttle lift for an editaton - 2022-11-30 - https://phabricator.wikimedia.org/T324105 [14:09:45] (03PS2) 10Filippo Giunchedi: varnish: check vcl reload for old and new state [puppet] - 10https://gerrit.wikimedia.org/r/862266 (https://phabricator.wikimedia.org/T314118) [14:10:02] PROBLEM - Confd vcl based reload on cp6016 is CRITICAL: reload-vcl failed to run since 479h, 19 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:10:14] RECOVERY - Check systemd state on planet1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:10:17] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:10:52] (03CR) 10Jbond: [C: 03+2] P:idp::standalone: create virtual env for idp-test-login app [puppet] - 10https://gerrit.wikimedia.org/r/862262 (owner: 10Jbond) [14:11:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:11:51] Daimona: i still feel that granting user rights by config changes is very confusing (and user groups are easier to understand). not saying it can't get synced out as-is, but before doing it, it should probably have +1 from a different deployer too. [14:12:49] (03CR) 10Vgutierrez: [C: 03+1] varnish: check vcl reload for old and new state [puppet] - 10https://gerrit.wikimedia.org/r/862266 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [14:12:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:13:00] (03CR) 10Filippo Giunchedi: [C: 03+2] varnish: check vcl reload for old and new state [puppet] - 10https://gerrit.wikimedia.org/r/862266 (https://phabricator.wikimedia.org/T314118) (owner: 10Filippo Giunchedi) [14:13:05] (03PS3) 10Filippo Giunchedi: varnish: check vcl reload for old and new state [puppet] - 10https://gerrit.wikimedia.org/r/862266 (https://phabricator.wikimedia.org/T314118) [14:13:11] I think we should not use a user group for now, because we don't know yet what will be the criterias for someone to become an organizer, and if we will really use a user group or not. [14:14:02] urbanecm: Yup, I certainly agree with that. But I also think a user group would add some cost and potential confusion, especially if it won't be chosen as the final implementation [14:15:36] unfortunately, I'm not comfortable deploying the patch as-is (ie. with a hook and no +1 from a deployer). i can ping other deployers, see what they think and if they're ok, it can be synced in a later window. or, we can change it to an user-group solution. [14:15:40] not sure what you prefer Daimona & others [14:17:05] (03CR) 10FNegri: harbor: remove unused harbor::db module/role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860627 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [14:17:14] RECOVERY - Confd vcl based reload on cp3051 is OK: reload-vcl successfully ran 1150h, 29 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:17:52] * Lucas_WMDE here now fwiw [14:17:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:18:05] Lucas_WMDE: if you have any opinion on the matter, I'm curious to hear what you think [14:18:09] Hearing more thoughts wouldn't hurt. I'm still not convinced that a user group would be the best solution, but I do see the shortcomings of the current implementation and am open to more opinions [14:18:18] (03CR) 10Clément Goubert: [C: 03+2] mw-jobrunner: Better SAL logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/862232 (https://phabricator.wikimedia.org/T303900) (owner: 10Clément Goubert) [14:18:24] * Lucas_WMDE scrolls up [14:18:41] Lucas_WMDE: TLDR: the question is is it a good idea to use UserGetRights hook to grant rights for "testing" (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/859634/) [14:19:07] I prefer an user group, even though it might not be used in the final product, because it is a more understandable way of granting permissions, with less risks of getting confused. [14:19:25] !log reenable puppet on A:cp [14:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:56] PROBLEM - Confd vcl based reload on cp1089 is CRITICAL: reload-vcl failed to run since 1150h, 42 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:19:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T323907)', diff saved to https://phabricator.wikimedia.org/P41884 and previous config saved to /var/cache/conftool/dbconfig/20221130-141956-ladsgroup.json [14:19:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [14:20:04] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [14:20:12] I think I would also lean towards a user group [14:20:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [14:20:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T323907)', diff saved to https://phabricator.wikimedia.org/P41885 and previous config saved to /var/cache/conftool/dbconfig/20221130-142018-ladsgroup.json [14:20:27] but let me test locally how MediaWiki behaves when a user group is un-defined [14:20:56] whether it’s handled gracefully or not [14:21:12] Lucas_WMDE: afaik it is handled very gracefully, but can't hurt to (re)test. [14:21:36] PROBLEM - Confd vcl based reload on cp2035 is CRITICAL: reload-vcl failed to run since 179h, 33 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:22:43] I would also like to test, but made the bad decision to upgrade docker a few mins ago, so have to spin up the env etc [14:22:59] (03Merged) 10jenkins-bot: mw-jobrunner: Better SAL logging [deployment-charts] - 10https://gerrit.wikimedia.org/r/862232 (https://phabricator.wikimedia.org/T303900) (owner: 10Clément Goubert) [14:23:10] PROBLEM - Confd vcl based reload on cp3056 is CRITICAL: reload-vcl failed to run since 1150h, 38 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:23:14] PROBLEM - Confd vcl based reload on cp5016 is CRITICAL: reload-vcl failed to run since 39h, 57 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:23:33] AFAICT it might not be possible to remove a no-longer-defined user group from a user who is still in the group [14:23:40] (03CR) 10Elukey: [C: 03+2] Add basic rate-limit capabilities to ML clusters (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/860925 (https://phabricator.wikimedia.org/T300259) (owner: 10Elukey) [14:23:51] there’s no option for the group on Special:UserRights nor in the userrights APIA [14:23:52] *API [14:23:56] but that seems acceptable to me [14:23:59] Lucas_WMDE: emptyUserGroup.php from server-side (I routinely run it after deleting a group) [14:24:03] ah ok [14:24:06] RECOVERY - Confd vcl based reload on cp6012 is OK: reload-vcl successfully ran 479h, 34 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:24:12] RECOVERY - Confd vcl based reload on cp2042 is OK: reload-vcl successfully ran 179h, 38 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:24:30] the recoveries are hamrless ^ just spammy [14:24:46] RECOVERY - Confd vcl based reload on cp6007 is OK: reload-vcl successfully ran 672h, 33 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:24:52] I was gonna say, if we forget to empty the group before, just temporarily define it again [14:24:52] otherwise MediaWiki seems to handle it gracefully [14:24:58] RECOVERY - Confd vcl based reload on cp3056 is OK: reload-vcl successfully ran 1150h, 39 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:25:02] RECOVERY - Confd vcl based reload on cp5016 is OK: reload-vcl successfully ran 39h, 59 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:25:16] Indeed, that seems to be the case [14:25:32] RECOVERY - Confd vcl based reload on cp2030 is OK: reload-vcl successfully ran 179h, 39 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:25:33] we could put something like “beta” in the group name itself [14:25:48] so it’s clear that it’s not a permanent thing when someone looks at the log entries later [14:26:00] (and then add a separate non-beta group if that’s the decided-on solution later) [14:26:06] yeah, or "testers" [14:26:10] If the user group option is an easy change, and if we can revert it in case we do not want to use it in the future, it is fine for me. [14:26:19] cmelo: i think it is [14:26:21] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [14:26:22] I agree [14:26:23] Yeah, that's what I was thinking [14:26:36] so, let's change towards an user group? [14:26:48] PROBLEM - Confd vcl based reload on cp3055 is CRITICAL: reload-vcl failed to run since 1150h, 39 minutes. https://wikitech.wikimedia.org/wiki/Varnish [14:26:55] A meta-only temporary group of beta testers would work, I think, yes [14:26:56] RECOVERY - Confd vcl based reload on cp6004 is OK: reload-vcl successfully ran 672h, 35 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:27:10] RECOVERY - Confd vcl based reload on cp3054 is OK: reload-vcl successfully ran 1150h, 42 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:27:12] great. as soon as patches are updated, happy to deploy that. [14:27:35] For the sake of speed... What do I need to change? [14:27:36] RECOVERY - Confd vcl based reload on cp6016 is OK: reload-vcl successfully ran 479h, 37 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:27:48] RECOVERY - Confd vcl based reload on cp6015 is OK: reload-vcl successfully ran 479h, 37 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:27:52] RECOVERY - Confd vcl based reload on cp1090 is OK: reload-vcl successfully ran 1150h, 48 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:27:52] (03PS2) 10Matthias Mullie: Add mlitn to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/862245 (https://phabricator.wikimedia.org/T324101) [14:27:56] (03CR) 10Matthias Mullie: Add mlitn to analytics-platform-eng-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/862245 (https://phabricator.wikimedia.org/T324101) (owner: 10Matthias Mullie) [14:28:09] Daimona: you need to add a new group to IS.php, under groupOverrides (metawiki) [14:28:23] (03PS1) 10Jbond: idp::standalone: add chdir for uwsgi app [puppet] - 10https://gerrit.wikimedia.org/r/862271 [14:28:23] and basically abandon https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/859634/ (or replace it with the new IS patch [14:28:36] RECOVERY - Confd vcl based reload on cp6014 is OK: reload-vcl successfully ran 479h, 38 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:28:47] And what about i18n? [14:28:58] let's do it on-wiki [14:28:58] RECOVERY - Confd vcl based reload on cp1089 is OK: reload-vcl successfully ran 1150h, 51 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:29:00] RECOVERY - Confd vcl based reload on cp2035 is OK: reload-vcl successfully ran 179h, 41 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:30:05] needs `MediaWiki:Group-GROUPNAME` and `MediaWiki:Group-GROUPNAME-member`. [14:30:48] RECOVERY - Confd vcl based reload on cp3055 is OK: reload-vcl successfully ran 1150h, 43 minutes ago. https://wikitech.wikimedia.org/wiki/Varnish [14:30:54] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [14:31:34] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [14:31:57] !log cgoubert@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [14:31:57] !log cgoubert@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [14:32:13] yeah, I don’t think this one needs to go into WikimediaMessages, on-wiki should be fine [14:32:15] !log cgoubert@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [14:32:17] !log cgoubert@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [14:32:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] zotero: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860710 (owner: 10Giuseppe Lavagetto) [14:32:30] (03CR) 10Andrew Bogott: [C: 03+2] Openstack config: move keystone_authtoken into a shared template [puppet] - 10https://gerrit.wikimedia.org/r/861940 (owner: 10Andrew Bogott) [14:32:35] Lucas_WMDE: indeed. the final one would need to be in WikimediaMessages, but for a single-wiki test, onwiki's better. [14:33:00] Right, because it's meta-only, perfect [14:33:05] (03CR) 10Jbond: [C: 03+2] idp::standalone: add chdir for uwsgi app [puppet] - 10https://gerrit.wikimedia.org/r/862271 (owner: 10Jbond) [14:33:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38536/console" [puppet] - 10https://gerrit.wikimedia.org/r/862256 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [14:33:32] Daimona: not sure if you have editinterface at metawiki. if not, i can put it there for you [14:33:57] (03PS1) 10Clément Goubert: mw-jobrunner: fix go templating [deployment-charts] - 10https://gerrit.wikimedia.org/r/862273 [14:34:18] (03PS1) 10Daimona Eaytoy: Create user group of beta testers of the CampaignEvents ext on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862274 (https://phabricator.wikimedia.org/T316227) [14:34:23] I don't think I do [14:34:27] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/862256 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [14:34:59] So, here's the new patch https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/862274 [14:35:16] The question would be who should be able to add people to that group [14:36:08] (03Abandoned) 10Daimona Eaytoy: Create list of users who can test the CampaignEvents extension on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859634 (https://phabricator.wikimedia.org/T316227) (owner: 10Daimona Eaytoy) [14:37:35] (03Merged) 10jenkins-bot: zotero: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860710 (owner: 10Giuseppe Lavagetto) [14:37:38] Also, it looks like the config change would be effective on beta-meta, too? [14:37:41] (03PS1) 10Hnowlan: cache: set api-gateway to normal [puppet] - 10https://gerrit.wikimedia.org/r/862276 [14:37:44] (03CR) 10Urbanecm: [C: 03+2] Create user group of beta testers of the CampaignEvents ext on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862274 (https://phabricator.wikimedia.org/T316227) (owner: 10Daimona Eaytoy) [14:38:28] Daimona: correct. for who is able to add users there, WMF T&S and Stewards are able to add/remove users to/from all groups, even if no group has that permission assigned [14:38:41] (03Merged) 10jenkins-bot: Create user group of beta testers of the CampaignEvents ext on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862274 (https://phabricator.wikimedia.org/T316227) (owner: 10Daimona Eaytoy) [14:38:54] OK, no-one being able to change group membership (except for T&S & stewards) seems good for now [14:39:05] But for beta, we would still want to grant that right to everyone [14:39:10] (03CR) 10Clément Goubert: [C: 03+2] mw-jobrunner: fix go templating [deployment-charts] - 10https://gerrit.wikimedia.org/r/862273 (owner: 10Clément Goubert) [14:39:28] Daimona: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/859635/ has merge conflict, can you rebase please? [14:39:48] (03PS1) 10Muehlenhoff: Add ganeti5004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/862278 [14:40:46] Daimona: ah, okay. then change InitialiseSettings-labs.php to grant the right to `user` at that wiki [14:40:54] fwiw your first patch would also affect beta [14:41:04] (03PS2) 10Daimona Eaytoy: Configure the CampaignEvents ext to use the x1.wikishared db for meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859635 (https://phabricator.wikimedia.org/T322745) [14:41:21] (03CR) 10Urbanecm: [C: 03+2] Configure the CampaignEvents ext to use the x1.wikishared db for meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859635 (https://phabricator.wikimedia.org/T322745) (owner: 10Daimona Eaytoy) [14:41:29] (03PS2) 10Urbanecm: Enable the CampaignEvents extension on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859636 (https://phabricator.wikimedia.org/T322745) (owner: 10Daimona Eaytoy) [14:41:33] You mean the one for the user group? I can fix that [14:41:38] (03CR) 10Urbanecm: [C: 03+2] Enable the CampaignEvents extension on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859636 (https://phabricator.wikimedia.org/T322745) (owner: 10Daimona Eaytoy) [14:41:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:41:44] Daimona: i already merged :D [14:42:11] (03Merged) 10jenkins-bot: Configure the CampaignEvents ext to use the x1.wikishared db for meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859635 (https://phabricator.wikimedia.org/T322745) (owner: 10Daimona Eaytoy) [14:42:24] (03Merged) 10jenkins-bot: Enable the CampaignEvents extension on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859636 (https://phabricator.wikimedia.org/T322745) (owner: 10Daimona Eaytoy) [14:42:39] i'll fetch this group of patches to mwdebug, so it can be tested [14:42:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:42:44] and beta can be fixed after [14:42:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:42:45] Whoops. [14:42:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862274 (https://phabricator.wikimedia.org/T316227) (owner: 10Daimona Eaytoy) [14:42:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859635 (https://phabricator.wikimedia.org/T322745) (owner: 10Daimona Eaytoy) [14:42:49] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/859636 (https://phabricator.wikimedia.org/T322745) (owner: 10Daimona Eaytoy) [14:42:58] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:862274|Create user group of beta testers of the CampaignEvents ext on meta (T316227)]], [[gerrit:859635|Configure the CampaignEvents ext to use the x1.wikishared db for meta (T322745)]], [[gerrit:859636|Enable the CampaignEvents extension on meta (T322745)]] [14:43:04] (03CR) 10Muehlenhoff: [C: 03+2] Add ganeti5004 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/862278 (owner: 10Muehlenhoff) [14:43:06] T322745: Enable CampaignEvents extension on Meta-wiki - https://phabricator.wikimedia.org/T322745 [14:43:06] T316227: Specify initial list of organizers via configuration - https://phabricator.wikimedia.org/T316227 [14:43:26] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 327 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:43:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:43:59] !log urbanecm@deploy1002 urbanecm and daimona: Backport for [[gerrit:862274|Create user group of beta testers of the CampaignEvents ext on meta (T316227)]], [[gerrit:859635|Configure the CampaignEvents ext to use the x1.wikishared db for meta (T322745)]], [[gerrit:859636|Enable the CampaignEvents extension on meta (T322745)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wm [14:44:00] net, mwdebug1002.eqiad.wmnet [14:44:10] Daimona: all's at mwdebug1001 now for testing [14:44:21] (still needs i18n though, if you can tell me how the group should be called, i can add that too now) [14:44:29] Wait, still checking a couple things [14:44:35] Should have put a WIP on my patch :) [14:44:42] (03Merged) 10jenkins-bot: mw-jobrunner: fix go templating [deployment-charts] - 10https://gerrit.wikimedia.org/r/862273 (owner: 10Clément Goubert) [14:44:54] Daimona: my apologies. i thought it's ready to go, since it was scheduled [14:45:05] Oh only the user group one I mean [14:45:27] ah [14:45:28] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:45:40] well, i can pull other patch to mwdebug if you tell me which one :) [14:45:48] this time i'll wait with hitting the buttons [14:45:50] (03CR) 10Muehlenhoff: [C: 03+1] "Patch looks fine (only thing missing is the approvals on T324101)" [puppet] - 10https://gerrit.wikimedia.org/r/862245 (https://phabricator.wikimedia.org/T324101) (owner: 10Matthias Mullie) [14:46:02] HouseOfM, cmelo: could you please check that things look good on meta (mwdebug1001) while I look at the user group thing for beta? [14:46:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q2:rack/setup/install cephosd100[1-5] - https://phabricator.wikimedia.org/T322760 (10Jclark-ctr) installed 6.4tb nvme into servers [14:46:14] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10MoritzMuehlenhoff) And in addition @odimitrijevic or @Ottomata [14:46:41] !log cgoubert@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [14:46:41] !log cgoubert@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [14:46:44] No worries, I only want to check things for beta. Let's see how things are looking in prod, but I think that part was ready [14:47:02] !log cgoubert@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [14:47:06] great. [14:47:06] Daimona yes , I will check it [14:47:12] !log cgoubert@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [14:47:12] Thanks! [14:47:42] (03CR) 10David Caro: harbor: remove unused harbor::db module/role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860627 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [14:47:48] (03CR) 10Ammarpad: Add ContactPage and ArbCom form to EnWiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/860946 (https://phabricator.wikimedia.org/T321447) (owner: 10Wugapodes) [14:48:12] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:48:20] urbanecm: So apparently IS-labs.php has nothing for user groups that I can copy?! [14:48:39] Daimona: you need to add groupOverrides entry, similar to the one in IS.php [14:48:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:48:49] (if groupOverrides is not defined, you need to declare it too) [14:48:58] And it works the same, I guess? [14:48:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:49:09] yup [14:49:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:49:29] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:49:36] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:49:38] (03PS4) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [14:49:43] I could also do that in CS-labs.php if that's fine, I think it would be more self-contained [14:50:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:50:24] (03CR) 10CI reject: [V: 04-1] WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [14:50:35] Daimona: yeah, that's fine with me too. [14:51:16] cmelo: fwiw the group at meta is currently empty. I can add you to it if needed. [14:51:20] RECOVERY - Host elastic1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 5.09 ms [14:51:28] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48975 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:52:03] urbanecm yes please, thanks, my user name is CMelo (WMF) [14:52:08] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:52:18] RECOVERY - Host an-presto1006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 10.41 ms [14:53:06] (03PS1) 10Daimona Eaytoy: Give campaignevents-enable-registration to all users on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862280 (https://phabricator.wikimedia.org/T316227) [14:53:16] cmelo: okay, sec [14:53:17] Daimona I am testing it it looks good, I can see all the special pages, but I was not able to create and event to test, I will test is as soon as urbanecm adds me to the group [14:53:34] urbanecm: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/862280 is the fix for beta [14:53:48] (03CR) 10CI reject: [V: 04-1] Give campaignevents-enable-registration to all users on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862280 (https://phabricator.wikimedia.org/T316227) (owner: 10Daimona Eaytoy) [14:53:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:54:14] RECOVERY - SSH on db1122.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:54:50] (03PS2) 10Daimona Eaytoy: Give campaignevents-enable-registration to all users on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862280 (https://phabricator.wikimedia.org/T316227) [14:55:32] cmelo: added [14:55:46] urbanecm thanks [14:55:49] Daimona: looking. i an merge it once the prod's sorted [14:56:00] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862280 (https://phabricator.wikimedia.org/T316227) (owner: 10Daimona Eaytoy) [14:56:21] Perfect! I also quickly checked prod and it seems in order [14:56:22] (03CR) 10FNegri: [C: 03+1] harbor: remove unused harbor::db module/role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/860627 (https://phabricator.wikimedia.org/T267616) (owner: 10David Caro) [14:56:31] perfect. once cmelo confirms i can sync it. [14:56:38] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Jclark-ctr) msw-e1-eqiad has a few bad ports moved Management to different ports an-presto1006 elastic1089 elastic1090 [14:57:03] (03CR) 10Lucas Werkmeister (WMDE): Add Property (120) to Wikidata content Namespace (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862247 (https://phabricator.wikimedia.org/T321282) (owner: 10Guergana Tzatchkova) [14:57:32] RECOVERY - Host elastic1090.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [14:58:24] Daimona seems everything is good, urbanecm could you please add these users names to the group: [14:58:25] https://meta.wikimedia.org/wiki/Campaigns/Foundation_Product_Team/Registration/V1_Summary#Organizer_testers [14:58:35] will do once synced :) [14:58:43] cmelo: Daimona: sounds great! so, i guess ok to sync now? [14:59:00] SGTM [14:59:09] doing [15:00:45] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10fgiunchedi) Thank you @Jclark-ctr ! FTR the task will be updated automatically once the alerts recover i.e. leaving only the hosts still alerting [15:00:58] (03PS5) 10JMeybohm: Rewrite as kubernetes operator/controller [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) [15:01:00] (03PS5) 10JMeybohm: update vendor [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861353 (https://phabricator.wikimedia.org/T323706) [15:01:11] jouncebot: next [15:01:11] In 3 hour(s) and 58 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1900) [15:01:11] In 3 hour(s) and 58 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1900) [15:02:24] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Jclark-ctr) mw1376 moved to different port on msw [15:02:48] urbanecm thank you! [15:03:09] (03CR) 10JMeybohm: Rewrite as kubernetes operator/controller (035 comments) [software/helm-state-metrics] - 10https://gerrit.wikimedia.org/r/861352 (https://phabricator.wikimedia.org/T323706) (owner: 10JMeybohm) [15:03:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T323907)', diff saved to https://phabricator.wikimedia.org/P41886 and previous config saved to /var/cache/conftool/dbconfig/20221130-150320-ladsgroup.json [15:03:28] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [15:03:39] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:862274|Create user group of beta testers of the CampaignEvents ext on meta (T316227)]], [[gerrit:859635|Configure the CampaignEvents ext to use the x1.wikishared db for meta (T322745)]], [[gerrit:859636|Enable the CampaignEvents extension on meta (T322745)]] (duration: 20m 41s) [15:03:46] T322745: Enable CampaignEvents extension on Meta-wiki - https://phabricator.wikimedia.org/T322745 [15:03:47] T316227: Specify initial list of organizers via configuration - https://phabricator.wikimedia.org/T316227 [15:04:10] And, synced. [15:04:28] urbanecm great thanks! [15:04:47] thanks Daimona [15:04:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:05:31] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Jclark-ctr) i am only having issues with one that will need to be rebooted labstore1004 [15:05:46] Cool, thank you. I know we're over time, but would it be possible to fix beta as well? Or better do that in the next window? [15:08:07] yeah, sure [15:08:21] (03PS3) 10Urbanecm: Give campaignevents-enable-registration to all users on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862280 (https://phabricator.wikimedia.org/T316227) (owner: 10Daimona Eaytoy) [15:08:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862280 (https://phabricator.wikimedia.org/T316227) (owner: 10Daimona Eaytoy) [15:08:29] doing beta [15:08:35] Thanks! [15:09:04] I've also seen roughly 100 PHP notices on meta happening quickly at around 15:00 UTC but that doesn't seem related. I've filed https://phabricator.wikimedia.org/T324119. [15:09:10] (03Merged) 10jenkins-bot: Give campaignevents-enable-registration to all users on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862280 (https://phabricator.wikimedia.org/T316227) (owner: 10Daimona Eaytoy) [15:09:55] (03PS15) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [15:09:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:10:24] thanks Daimona [15:10:27] beta's fixed [15:10:32] lemme do user groups now [15:10:38] Not working for me [15:11:00] Daimona: it takes ~30 minutes to be deployed [15:11:07] https://meta.wikimedia.org/wiki/Campaigns/Foundation_Product_Team/Registration/V1_Summary#Organizer_testers is the list of users to promote, right? [15:11:10] Ohhhh well [15:11:18] is there any particular comment i should make in the userrights changes? [15:11:24] (like a phab task linked or similar) [15:11:28] Yup, that's the list [15:11:55] Maybe just the permalink of that version and T316227 [15:11:55] T316227: Specify initial list of organizers via configuration - https://phabricator.wikimedia.org/T316227 [15:12:02] will do [15:12:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38537/console" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [15:15:15] cmelo: Daimona: i think i promoted all (see https://meta.wikimedia.org/wiki/special:Log/Martin_Urbanec/rights) [15:15:17] can you double-check? [15:15:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:15:53] (03PS16) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [15:16:29] Yup, LGTM [15:16:30] (03CR) 10CI reject: [V: 04-1] C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [15:16:39] great [15:16:46] so, i think we're done? [15:16:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:16:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:16:57] And beta is working now [15:17:06] I think we're done [15:17:06] (03PS1) 10Effie Mouzeli: osm: install imposm-deploy-import on servers [puppet] - 10https://gerrit.wikimedia.org/r/862281 [15:17:17] perfect [15:17:18] Only remaining thing would be i18n for the group [15:17:21] ah, yes [15:17:24] what should it be? [15:17:55] (03CR) 10Jgiannelos: [C: 03+1] osm: install imposm-deploy-import on servers [puppet] - 10https://gerrit.wikimedia.org/r/862281 (owner: 10Effie Mouzeli) [15:18:00] That's a good question [15:18:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P41888 and previous config saved to /var/cache/conftool/dbconfig/20221130-151827-ladsgroup.json [15:18:28] Can I talk about this with the team and get back to you later when I have an answer? [15:18:32] absolutely [15:18:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:19:24] (03PS1) 10Andrew Bogott: Specify pymysql driver for glance and barbican [puppet] - 10https://gerrit.wikimedia.org/r/862283 (https://phabricator.wikimedia.org/T323319) [15:19:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:20:47] Perfect, thank you! [15:21:26] Then I think we can call it a day. [15:21:30] (03PS17) 10Slyngshede: C:ldap::client::utils Rewrite add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/860568 [15:21:34] great [15:21:59] Thank you again :) [15:22:08] happy to help [15:22:40] (03CR) 10Jgiannelos: osm: install imposm-deploy-import on servers [puppet] - 10https://gerrit.wikimedia.org/r/862281 (owner: 10Effie Mouzeli) [15:23:09] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38540/console" [puppet] - 10https://gerrit.wikimedia.org/r/860568 (owner: 10Slyngshede) [15:23:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:25:51] (03PS1) 10Andrew Bogott: designate: remove 'notify' flag; no longer supported [puppet] - 10https://gerrit.wikimedia.org/r/862286 (https://phabricator.wikimedia.org/T323319) [15:27:50] jouncebot: now [15:27:51] No deployments scheduled for the next 3 hour(s) and 32 minute(s) [15:28:13] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!!" [puppet] - 10https://gerrit.wikimedia.org/r/861871 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite) [15:28:23] (03CR) 10MSantos: [C: 04-1] "As discussed with Yiannis in slack. This script has been moved to a step in the `imposm-initial-import` script. We should update docs and " [puppet] - 10https://gerrit.wikimedia.org/r/862281 (owner: 10Effie Mouzeli) [15:28:26] (03PS2) 10Effie Mouzeli: osm: remove imposm-deploy-import [puppet] - 10https://gerrit.wikimedia.org/r/862281 [15:28:55] (03CR) 10Andrew Bogott: [C: 03+2] Specify pymysql driver for glance and barbican [puppet] - 10https://gerrit.wikimedia.org/r/862283 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [15:29:34] (03CR) 10Andrew Bogott: [C: 03+2] designate: remove 'notify' flag; no longer supported [puppet] - 10https://gerrit.wikimedia.org/r/862286 (https://phabricator.wikimedia.org/T323319) (owner: 10Andrew Bogott) [15:30:15] Hi, I'm unable to submit edit on beta cluster [15:30:28] (created https://phabricator.wikimedia.org/T324120 [15:30:34] cirno: thanks for the report [15:30:38] Daimona: can that be related by any chance? [15:31:05] Uhmmmm weird [15:31:06] (03CR) 10Clément Goubert: [V: 03+1] "CampaignEvents has been enabled on production metawiki, so we can now merge this change" [puppet] - 10https://gerrit.wikimedia.org/r/861813 (https://phabricator.wikimedia.org/T320403) (owner: 10Clément Goubert) [15:31:21] (probably not, but it's the only recent beta change) [15:31:26] (03CR) 10Filippo Giunchedi: [C: 03+2] hiera: move thanos-web to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/862258 (https://phabricator.wikimedia.org/T323913) (owner: 10Filippo Giunchedi) [15:31:27] I can't think of anything obvious, but maybe? I feel like it's the 20th time I see this error for beta [15:33:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P41889 and previous config saved to /var/cache/conftool/dbconfig/20221130-153333-ladsgroup.json [15:33:58] !log jiji@maps1009 imposm-removebackup-import - T314472 [15:34:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:05] T314472: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 [15:34:32] (03PS3) 10Ladsgroup: mediawiki: Add quarterly cleanup of flaggedtemplates table [puppet] - 10https://gerrit.wikimedia.org/r/861507 (https://phabricator.wikimedia.org/T290769) [15:34:38] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mediawiki: Add quarterly cleanup of flaggedtemplates table [puppet] - 10https://gerrit.wikimedia.org/r/861507 (https://phabricator.wikimedia.org/T290769) (owner: 10Ladsgroup) [15:35:12] !log roll-restart pybal on lvs[21]020 to pick up thanos-web service and then on lvs1019 lvs2009 - T323913 [15:35:18] PROBLEM - PyBal connections to etcd on lvs2009 is CRITICAL: CRITICAL: 69 connections established with conf2004.codfw.wmnet:4001 (min=70) https://wikitech.wikimedia.org/wiki/PyBal [15:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:19] T323913: Move thanos-sso away from CNAME discovery.wmnet - https://phabricator.wikimedia.org/T323913 [15:36:35] jouncebot nowandnext [15:36:35] No deployments scheduled for the next 3 hour(s) and 23 minute(s) [15:36:35] In 3 hour(s) and 23 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1900) [15:36:35] In 3 hour(s) and 23 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1900) [15:36:52] Attempting to get wmf.12 to testwikis [15:37:15] fingers crossed dancy :) [15:37:38] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.77:443]) https://wikitech.wikimedia.org/wiki/PyBal [15:38:22] PROBLEM - graphite.wikimedia.org api on graphite1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 2797 bytes in 3.068 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [15:38:56] the pybal ipvs diff is expected [15:39:00] should resolve shortly [15:39:22] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 74 connections established with conf1007.eqiad.wmnet:4001 (min=75) https://wikitech.wikimedia.org/wiki/PyBal [15:39:30] RECOVERY - graphite.wikimedia.org api on graphite1004 is OK: HTTP OK: HTTP/1.1 200 OK - 311 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting [15:40:08] godog: https://giphy.com/gifs/this-is-fine-QMHoU66sBXqqLqYvGO [15:40:09] urbanecm: For the group messages, you can use "CampaignEvents beta testers" for the group itself, and "CampaignEvents beta tester" for -member [15:40:21] Does that work for you? [15:40:21] perfect [15:40:26] RECOVERY - PyBal connections to etcd on lvs2009 is OK: OK: 70 connections established with conf2004.codfw.wmnet:4001 (min=70) https://wikitech.wikimedia.org/wiki/PyBal [15:40:31] (03PS1) 10Andrew Bogott: catch up to the recent keystone_authtoken refactor [puppet] - 10https://gerrit.wikimedia.org/r/862288 [15:40:33] lolz elukey [15:40:51] lol [15:41:20] (03PS2) 10Andrew Bogott: profile::openstack::base::cinder::backup: more keystone_authtoken refactor [puppet] - 10https://gerrit.wikimedia.org/r/862288 [15:41:42] Daimona: done. should we set grouppage- to something? [15:41:51] (03PS5) 10Elukey: WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [15:41:53] currently group name's linkless, cf. https://meta.wikimedia.org/wiki/Special:UserRights/CMelo_(WMF) [15:42:06] Amazing, thank you again :) [15:42:15] Let me ask about that [15:42:18] sure [15:42:24] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862290 (https://phabricator.wikimedia.org/T320517) [15:42:26] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862290 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [15:42:45] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [15:42:47] (03CR) 10CI reject: [V: 04-1] WIP - Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [15:42:58] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:43:08] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862290 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [15:43:26] (03CR) 10Andrew Bogott: [C: 03+2] profile::openstack::base::cinder::backup: more keystone_authtoken refactor [puppet] - 10https://gerrit.wikimedia.org/r/862288 (owner: 10Andrew Bogott) [15:43:31] !log dancy@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.12 refs T320517 [15:43:38] T320517: 1.40.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T320517 [15:44:43] found something even stranger, I could not logout from beta cluster: T324121 [15:44:43] T324121: Unable to logout from beta cluster - https://phabricator.wikimedia.org/T324121 [15:44:44] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 75 connections established with conf1007.eqiad.wmnet:4001 (min=75) https://wikitech.wikimedia.org/wiki/PyBal [15:45:11] urbanecm: Could it be a link to https://meta.wikimedia.org/wiki/Campaigns/Foundation_Product_Team/Registration/V1_Summary#Organizer_testers ? [15:45:21] i think so [15:45:43] (03CR) 10Filippo Giunchedi: [C: 03+1] install_server: set eqiad bullseye vms to install bullseye [puppet] - 10https://gerrit.wikimedia.org/r/861871 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite) [15:45:48] done [15:47:59] (03CR) 10Elukey: "Hugh I have a question - do we get a default caching behavior (like cache for X hours etc..) if we move away from pass? I recall something" [puppet] - 10https://gerrit.wikimedia.org/r/862276 (owner: 10Hnowlan) [15:48:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T323907)', diff saved to https://phabricator.wikimedia.org/P41890 and previous config saved to /var/cache/conftool/dbconfig/20221130-154840-ladsgroup.json [15:48:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [15:48:48] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [15:48:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [15:48:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:48:59] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:49:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance [15:49:16] Amazing, and at the cost of being repetitive: thank you very much :) [15:49:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T323907)', diff saved to https://phabricator.wikimedia.org/P41891 and previous config saved to /var/cache/conftool/dbconfig/20221130-154917-ladsgroup.json [15:49:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860712 (owner: 10Giuseppe Lavagetto) [15:49:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:49:53] Daimona: no problem. good luck with the project :) [15:51:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [15:51:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:52:06] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:54:06] !log akosiaris@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on thumbor2004.codfw.wmnet with reason: work on iDrac [15:54:07] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on ores2009.codfw.wmnet with reason: DCOps maintenance [15:54:16] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Ottomata) > Reason for access: need query search usage via jupyter for Structured Data pipelines I'm not sure if analytics-platform-eng-admins is the... [15:54:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ores2009.codfw.wmnet with reason: DCOps maintenance [15:54:21] !log akosiaris@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on thumbor2004.codfw.wmnet with reason: work on iDrac [15:54:22] (03Merged) 10jenkins-bot: mediawiki: convert to modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/860712 (owner: 10Giuseppe Lavagetto) [15:54:27] 10SRE, 10ops-codfw, 10serviceops: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=8b8e8a4d-71f2-462d-8e1f-ff904f7e3ed4) set by akosiaris@cumin1001 for 1:00:00 on 1 host(s) and their services... [15:56:00] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10MarkTraceur) Approved! [15:57:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [15:58:06] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Request for access to analytics-platform-eng-admins for mlitn - https://phabricator.wikimedia.org/T324101 (10Marostegui) I will hold until the group needed is sorted :) [15:58:28] 10SRE, 10ops-codfw, 10serviceops: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10klausman) ores2009 is shutting down & powering off now [15:58:49] (03PS1) 10David Caro: quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 [15:58:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:59:14] <_joe_> sigh I forgot [16:02:21] (03PS1) 10Btullis: Use parquet column names to order results from hive in presto [puppet] - 10https://gerrit.wikimedia.org/r/862295 (https://phabricator.wikimedia.org/T321960) [16:02:23] _joe_ ? [16:02:55] <_joe_> claime: that this is the time of a backport window [16:02:58] (03CR) 10CI reject: [V: 04-1] quota_increase: Fix issue with dashed quota names [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/862294 (owner: 10David Caro) [16:03:07] <_joe_> so I just had the deployment to mwdebug failing [16:03:19] oh right [16:03:22] RECOVERY - DNS on mw1376.mgmt is OK: DNS OK: 0.010 seconds response time. mw1376.mgmt.eqiad.wmnet returns 10.65.2.135 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:03:29] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [16:05:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T323907)', diff saved to https://phabricator.wikimedia.org/P41892 and previous config saved to /var/cache/conftool/dbconfig/20221130-160527-ladsgroup.json [16:05:37] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [16:07:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:09:54] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [16:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:10:14] PROBLEM - Host ores2003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:11:46] (03CR) 10Ottomata: [C: 03+1] "Nice" [puppet] - 10https://gerrit.wikimedia.org/r/862295 (https://phabricator.wikimedia.org/T321960) (owner: 10Btullis) [16:13:13] (03CR) 10Btullis: [C: 03+2] Use parquet column names to order results from hive in presto [puppet] - 10https://gerrit.wikimedia.org/r/862295 (https://phabricator.wikimedia.org/T321960) (owner: 10Btullis) [16:13:31] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [16:17:23] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [16:17:53] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [16:18:15] !log oblivian@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [16:18:15] !log oblivian@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [16:19:23] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:19:59] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:20:21] !log btullis@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [16:20:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P41895 and previous config saved to /var/cache/conftool/dbconfig/20221130-162034-ladsgroup.json [16:20:49] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [16:20:50] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [16:20:52] !log oblivian@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [16:21:25] !log oblivian@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [16:21:34] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [16:22:04] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:22:18] !log oblivian@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [16:22:18] !log oblivian@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [16:22:34] !log oblivian@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [16:22:36] !log oblivian@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [16:22:43] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:23:07] (03PS2) 10Eevans: echostore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861925 (https://phabricator.wikimedia.org/T253244) [16:23:25] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.12 refs T320517 (duration: 39m 53s) [16:23:31] T320517: 1.40.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T320517 [16:24:18] PROBLEM - Host ores2009 is DOWN: PING CRITICAL - Packet loss = 100% [16:24:55] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [16:25:03] (03CR) 10Eevans: [C: 03+2] echostore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861925 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans) [16:25:03] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [16:25:39] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [16:25:49] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [16:25:54] !log dancy@deploy1002 Pruned MediaWiki: 1.40.0-wmf.8 (duration: 02m 26s) [16:26:10] Rolling wmf.12 to group0 [16:27:59] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862299 (https://phabricator.wikimedia.org/T320517) [16:28:01] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862299 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [16:28:46] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862299 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [16:28:59] (03CR) 10MVernon: [C: 03+1] swift: move ms-be2050 to new naming schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [16:29:37] (03Merged) 10jenkins-bot: echostore: bump container version to v1.0.10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861925 (https://phabricator.wikimedia.org/T253244) (owner: 10Eevans) [16:31:00] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [16:31:48] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:32:22] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:31] XioNoX, topranks: expected? ^^^ related to eqsin refresh? [16:33:32] sukhe: ^^ is that you? [16:33:40] looking [16:33:47] that's an anycast host [16:34:29] not me [16:34:50] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply [16:34:53] no current reimaging work from Traffic side [16:34:59] all the BGP sessions bounced [16:35:00] RECOVERY - Host ores2003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.62 ms [16:35:06] they're all back exept 1 v6 [16:35:08] oh? [16:35:19] (ProbeDown) firing: Service upload-https:443 has failed probes (http_upload-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:35:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P41896 and previous config saved to /var/cache/conftool/dbconfig/20221130-163540-ladsgroup.json [16:36:48] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply [16:37:23] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.12 refs T320517 [16:37:29] T320517: 1.40.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T320517 [16:38:30] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: OpenSent - HE, AS6939/IPv4: OpenConfirm - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:40:19] (ProbeDown) resolved: Service upload-https:443 has failed probes (http_upload-https_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:41:26] (03PS1) 10Btullis: Update the presto server catalogs with new parquet settings [puppet] - 10https://gerrit.wikimedia.org/r/862305 (https://phabricator.wikimedia.org/T321960) [16:42:20] (03CR) 10Btullis: "I forgot to do this in the previous commit." [puppet] - 10https://gerrit.wikimedia.org/r/862305 (https://phabricator.wikimedia.org/T321960) (owner: 10Btullis) [16:42:34] (03CR) 10Btullis: [C: 03+2] Update the presto server catalogs with new parquet settings [puppet] - 10https://gerrit.wikimedia.org/r/862305 (https://phabricator.wikimedia.org/T321960) (owner: 10Btullis) [16:43:58] !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/echostore: apply [16:44:42] !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/echostore: apply [16:44:42] !log btullis@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [16:50:08] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 281 probes of 708 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:50:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T323907)', diff saved to https://phabricator.wikimedia.org/P41897 and previous config saved to /var/cache/conftool/dbconfig/20221130-165047-ladsgroup.json [16:50:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:50:55] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [16:51:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [16:51:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41898 and previous config saved to /var/cache/conftool/dbconfig/20221130-165108-ladsgroup.json [16:52:03] !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/echostore: apply [16:52:36] !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply [16:54:18] PROBLEM - Host thumbor2004 is DOWN: PING CRITICAL - Packet loss = 100% [16:55:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [16:56:14] jouncebot nowandnext [16:56:14] No deployments scheduled for the next 2 hour(s) and 3 minute(s) [16:56:14] In 2 hour(s) and 3 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1900) [16:56:15] In 2 hour(s) and 3 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1900) [16:56:59] Running one more scap sync-world to collect data [16:57:36] !log dancy@deploy1002 Started scap: testing k8s deploy [16:57:58] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder) [16:59:44] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Damilare) [17:00:34] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) is CRITICAL: Test retrieve a random article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [17:00:39] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Damilare) Apologies @Marostegui I've added updated the description with the required template. [17:00:50] RECOVERY - Host thumbor2004 is UP: PING OK - Packet loss = 0%, RTA = 31.68 ms [17:01:38] RECOVERY - Host thumbor2004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.76 ms [17:02:22] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 38 probes of 708 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:02:38] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [17:04:12] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Marostegui) Thanks @Damilare - which access do you need? https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I_request? You just need Turnilo without acc... [17:04:24] (03PS1) 10Eevans: echostore: bring codfw hosts up to date [deployment-charts] - 10https://gerrit.wikimedia.org/r/862307 (https://phabricator.wikimedia.org/T253244) [17:05:35] !log dancy@deploy1002 Finished scap: testing k8s deploy (duration: 07m 59s) [17:09:19] (03PS5) 10Klausman: API GW: add config for addtional LW inference services [deployment-charts] - 10https://gerrit.wikimedia.org/r/861401 (https://phabricator.wikimedia.org/T323916) [17:09:48] Done testing [17:17:37] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10XenoRyet) I'm Damilare's manager, and I approve. [17:18:19] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10greg) Hi @Marostegui! Dami (and @AnnWF if you see a task from her) will need PII access as well, so "Dashboards in Superset / Hive interfaces (like Hue) that do access private d... [17:23:27] (03CR) 10Hnowlan: [C: 03+2] API GW: add config for addtional LW inference services [deployment-charts] - 10https://gerrit.wikimedia.org/r/861401 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman) [17:25:36] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo for USER:Damilare Adedoyin - https://phabricator.wikimedia.org/T324058 (10Ottomata) Sounds like analytics-privatedata-users group membership without ssh and kerberos. Approved. [17:26:51] (03CR) 10Ssingh: [C: 03+2] cp5022: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861910 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [17:27:19] (03PS2) 10Ssingh: cp5022: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861910 (https://phabricator.wikimedia.org/T322048) [17:28:57] (03Merged) 10jenkins-bot: API GW: add config for addtional LW inference services [deployment-charts] - 10https://gerrit.wikimedia.org/r/861401 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman) [17:29:01] 10SRE, 10ops-codfw, 10serviceops: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10Papaul) thunbor2004 had a broken IDRAC card. I replaced it. [17:29:40] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/api-gateway: sync [17:29:58] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/api-gateway: sync [17:31:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41899 and previous config saved to /var/cache/conftool/dbconfig/20221130-173103-ladsgroup.json [17:31:10] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [17:31:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp5022.eqsin.wmnet with OS buster [17:32:11] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host cp5022.eqsin.wmnet with OS buster [17:37:03] (ProbeDown) firing: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:39:04] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/api-gateway: sync [17:39:35] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/api-gateway: sync [17:40:36] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/api-gateway: sync [17:41:01] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: sync [17:42:03] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:45:48] PROBLEM - Check systemd state on kubernetes1012 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P41900 and previous config saved to /var/cache/conftool/dbconfig/20221130-174609-ladsgroup.json [17:49:33] (ProbeDown) firing: (3) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:49:48] (ProbeDown) resolved: Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:52:30] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1012 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [17:57:50] (03CR) 10Btullis: [C: 03+1] Set role_contacts for role analytics_cluster::coordinator::replica [puppet] - 10https://gerrit.wikimedia.org/r/860608 (owner: 10Muehlenhoff) [18:00:02] 10SRE, 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Kubernetes: eventstreams cannot be deployed and its deployments will need to be destroyed and recreated - https://phabricator.wikimedia.org/T324074 (10BTullis) 05Open→03Resolved >>! In T324074#8431657, @Ottomata wrote: > Thank you bot... [18:01:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P41901 and previous config saved to /var/cache/conftool/dbconfig/20221130-180116-ladsgroup.json [18:01:17] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5022.eqsin.wmnet with reason: host reimage [18:04:40] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5022.eqsin.wmnet with reason: host reimage [18:06:47] (03PS1) 10Klausman: APIGW/Liftwing: Fix missing part of path regexen [deployment-charts] - 10https://gerrit.wikimedia.org/r/862311 (https://phabricator.wikimedia.org/T323916) [18:07:43] (03CR) 10Klausman: "./." [deployment-charts] - 10https://gerrit.wikimedia.org/r/861401 (https://phabricator.wikimedia.org/T323916) (owner: 10Klausman) [18:07:47] RECOVERY - Host ores2009.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.58 ms [18:08:13] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:09:33] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2173.mgmt.codfw.wmnet with reboot policy FORCED [18:10:46] rzl: FYI ^^^ this time foundation.wikimedia.org/wiki/Home timedout [18:11:19] (03CR) 10Btullis: [C: 03+1] "Looks good to me. I'm happy to approve this and merge tomorrow." [deployment-charts] - 10https://gerrit.wikimedia.org/r/860518 (owner: 10Giuseppe Lavagetto) [18:13:41] RECOVERY - Check systemd state on kubernetes1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:15:44] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2173.mgmt.codfw.wmnet with reboot policy FORCED [18:16:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41902 and previous config saved to /var/cache/conftool/dbconfig/20221130-181623-ladsgroup.json [18:16:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [18:16:30] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [18:16:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [18:16:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T323907)', diff saved to https://phabricator.wikimedia.org/P41903 and previous config saved to /var/cache/conftool/dbconfig/20221130-181644-ladsgroup.json [18:20:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2173.mgmt.codfw.wmnet with reboot policy FORCED [18:21:23] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:23:11] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1012 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [18:23:33] RECOVERY - SSH on mw1326.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:24:58] volans: weird, thanks [18:25:26] np [18:26:05] that one's a very quick page, so something legit probably went wrong [18:26:41] but if we don't care about it unless it happens a couple times in a row, that's another good argument for building that logic in [18:27:58] (03PS4) 10Raymond Ndibe: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) [18:28:22] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2173.mgmt.codfw.wmnet with reboot policy FORCED [18:28:39] RECOVERY - Host ores2009 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [18:28:44] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2173.mgmt.codfw.wmnet with reboot policy FORCED [18:28:47] PROBLEM - ores_workers_running on ores2009 is CRITICAL: PROCS CRITICAL: 1 process with command name celery https://wikitech.wikimedia.org/wiki/ORES [18:29:11] PROBLEM - ores on ores2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [18:30:15] RECOVERY - ores_workers_running on ores2009 is OK: PROCS OK: 89 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [18:30:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2173.mgmt.codfw.wmnet with reboot policy FORCED [18:30:31] RECOVERY - ores on ores2009 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [18:30:45] (03CR) 10CI reject: [V: 04-1] cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [18:30:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db2173.mgmt.codfw.wmnet with reboot policy FORCED [18:31:42] (03PS5) 10Raymond Ndibe: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) [18:34:13] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5022.eqsin.wmnet with OS buster [18:34:23] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host cp5022.eqsin.wmnet with OS buster completed: - cp5022 (**PASS**) -... [18:34:25] (03CR) 10jenkins-bot: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repo [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [18:35:11] (03PS2) 10Ssingh: cp5023: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861911 (https://phabricator.wikimedia.org/T322048) [18:36:32] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ssingh) [18:37:27] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2173.mgmt.codfw.wmnet with reboot policy FORCED [18:38:21] (03CR) 10Ottomata: "I am glad we recently converted to serviceops templates becuz otherwise I guess this would be harder to do!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/860518 (owner: 10Giuseppe Lavagetto) [18:39:13] PROBLEM - SSH on mw1327.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:40:05] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 207 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:41:03] 10SRE, 10ops-codfw, 10serviceops: codfw: ManagementSSHDown for ores2009 and thumbor2004 - https://phabricator.wikimedia.org/T323925 (10Papaul) 05Open→03Resolved ores2009 mgmt is back up [18:41:25] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:42:40] (03PS6) 10Raymond Ndibe: cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) [18:42:41] 10SRE, 10ops-codfw, 10DBA: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Papaul) 05Open→03Resolved @Marostegui main board replaced. The server is back up running. Sorry it took this long to get this fix. Thanks [18:42:53] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T323960 (10Papaul) 05Open→03Resolved a:03Papaul This is fix. [18:45:31] (03CR) 10CI reject: [V: 04-1] cookbooks: print out instructions on next step after updating the buildpack/tekton images in the local repository [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [18:49:59] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:51:39] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:52:39] 10SRE, 10ops-codfw: Degraded RAID on ganeti2013 - https://phabricator.wikimedia.org/T323222 (10Papaul) a:05Papaul→03MoritzMuehlenhoff @MoritzMuehlenhoff Disk replaced [18:52:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T323907)', diff saved to https://phabricator.wikimedia.org/P41904 and previous config saved to /var/cache/conftool/dbconfig/20221130-185254-ladsgroup.json [18:53:03] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [18:56:55] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:58:10] !log robh@cumin2002 START - Cookbook sre.hosts.provision for host lvs5004.mgmt.eqsin.wmnet with reboot policy FORCED [18:58:13] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:00:05] dancy and brennen: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1900). [19:00:05] dancy and brennen: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1900). [19:00:37] o/ just got to coffeeshop wifi, so i'm about for a few minutes [19:05:23] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM for VRTS - https://phabricator.wikimedia.org/T324030 (10Arnoldokoth) ` Ready to create Ganeti VM vrts2001.codfw.wmnet in the codfw cluster on group C with 4 vCPUs, 8GB of RAM, 25GB of disk in the private network. ` [19:05:23] !log aokoth@cumin1001 START - Cookbook sre.ganeti.makevm for new host vrts2001.codfw.wmnet [19:05:24] !log aokoth@cumin1001 START - Cookbook sre.dns.netbox [19:06:26] (03PS1) 10Ssingh: hiera: decommission dns5001 [puppet] - 10https://gerrit.wikimedia.org/r/862316 (https://phabricator.wikimedia.org/T323830) [19:08:02] (03CR) 10Raymond Ndibe: "No idea why this is failing" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/859582 (https://phabricator.wikimedia.org/T321188) (owner: 10Raymond Ndibe) [19:08:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P41905 and previous config saved to /var/cache/conftool/dbconfig/20221130-190801-ladsgroup.json [19:08:34] !log aokoth@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM vrts2001.codfw.wmnet - aokoth@cumin1001" [19:09:09] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [19:09:32] 10SRE, 10ops-codfw, 10decommission-hardware, 10User-fgiunchedi: decommission graphite2003.codfw.wmnet - https://phabricator.wikimedia.org/T323718 (10Papaul) [19:10:45] (03PS1) 10Ssingh: ntp/eqsin: move to dns5002 [dns] - 10https://gerrit.wikimedia.org/r/862318 (https://phabricator.wikimedia.org/T323830) [19:11:39] sukhe: we got a conflict in netbox data sync or so [19:11:41] Pressing the train button. [19:12:19] mutante: hi [19:12:21] from which change? [19:12:41] * brennen eyeballs logspam-watch [19:13:26] sukhe: your change: cp50* our change: vrts2001 unknown change: db1205 ... ? [19:13:31] sigh [19:13:38] oh? [19:13:46] I didn't get any prompt during the cookbook run [19:13:54] cp5022? or 21 and 22? [19:14:07] 22. [19:14:14] please merge them [19:14:17] no concern there [19:14:27] what about the db servers.. is this scary? [19:14:32] there is an active db server in there [19:14:35] that we have nothing to do with [19:14:55] papaul: ^ [19:15:12] db1204 and db1205 [19:15:13] seems like you were working on db1205? [19:15:18] and 1204 yes [19:15:35] https://phabricator.wikimedia.org/T313978#8429443 [19:15:43] both are pass [19:15:46] 3-way-merge [19:15:49] ha [19:16:06] I think given it says task is completed, you should go ahead and merge all [19:16:11] since vrts is your and arnoldokoth? [19:16:29] yes, that is us [19:16:30] sukhe: yes that is done [19:16:34] !log aokoth@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM vrts2001.codfw.wmnet - aokoth@cumin1001" [19:16:34] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:16:34] !log aokoth@cumin1001 START - Cookbook sre.dns.wipe-cache vrts2001.codfw.wmnet on all recursors [19:16:37] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) vrts2001.codfw.wmnet on all recursors [19:16:38] there we go [19:16:43] thanks sukhe [19:16:48] cool, thanks all! [19:17:13] robh: looks like a network mgmt cable to me [19:19:36] (03PS1) 10Ladsgroup: Fix PHP notice [extensions/GlobalBlocking] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861484 (https://phabricator.wikimedia.org/T324119) [19:20:15] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10cmooney) On the back of the meeting earlier and our discussion around Ceph I decided to look a little bit closer into the heartbeat... [19:20:46] (03PS1) 10Ssingh: sites.yaml: remove dns5001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/862321 (https://phabricator.wikimedia.org/T323830) [19:20:50] Since it looks like people are working on it, I'm going to hold train until T324119 is resolved. [19:20:51] T324119: PHP Notice: Trying to access array offset on value of type null - https://phabricator.wikimedia.org/T324119 [19:21:35] !log robh@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host lvs5004.mgmt.eqsin.wmnet with reboot policy FORCED [19:22:21] dancy: do you want me to deploy T324119? [19:22:28] Yes please! [19:23:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P41906 and previous config saved to /var/cache/conftool/dbconfig/20221130-192308-ladsgroup.json [19:23:27] jouncebot: nowandnext [19:23:27] For the next 0 hour(s) and 36 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1900) [19:23:27] For the next 1 hour(s) and 36 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T1900) [19:23:27] In 1 hour(s) and 36 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221130T2100) [19:23:57] (03CR) 10Ladsgroup: [C: 03+2] Fix PHP notice [extensions/GlobalBlocking] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861484 (https://phabricator.wikimedia.org/T324119) (owner: 10Ladsgroup) [19:26:08] (03Merged) 10jenkins-bot: Fix PHP notice [extensions/GlobalBlocking] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861484 (https://phabricator.wikimedia.org/T324119) (owner: 10Ladsgroup) [19:26:33] Amir1: wmf.10 incoming ? [19:26:56] dancy: is it worth fixing? [19:27:17] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 151 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:27:20] if it's broken for two weeks, It can stay broken for a more day [19:27:24] Yes please. It is near the top of my logspam indicator [19:27:32] okay [19:27:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/GlobalBlocking] (wmf/1.40.0-wmf.12) - 10https://gerrit.wikimedia.org/r/861484 (https://phabricator.wikimedia.org/T324119) (owner: 10Ladsgroup) [19:27:47] (03PS1) 10Ladsgroup: Fix PHP notice [extensions/GlobalBlocking] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861485 (https://phabricator.wikimedia.org/T324119) [19:27:53] (03CR) 10Ladsgroup: [C: 03+2] Fix PHP notice [extensions/GlobalBlocking] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861485 (https://phabricator.wikimedia.org/T324119) (owner: 10Ladsgroup) [19:27:57] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:861484|Fix PHP notice (T324119)]] [19:28:04] T324119: PHP Notice: Trying to access array offset on value of type null - https://phabricator.wikimedia.org/T324119 [19:28:36] !log aokoth@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host vrts2001.codfw.wmnet [19:28:41] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: Test retrieve the most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 503 (expecting: 200): /{domain}/v1/page/news (get In the News content) is CRITICAL: Test get In the News content returned th [19:28:41] cted status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [19:29:05] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:861484|Fix PHP notice (T324119)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [19:29:17] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:29:29] (03Merged) 10jenkins-bot: Fix PHP notice [extensions/GlobalBlocking] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861485 (https://phabricator.wikimedia.org/T324119) (owner: 10Ladsgroup) [19:29:32] (03CR) 10BCornwall: [C: 03+1] cp5023: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861911 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [19:30:06] (03PS1) 10Andrew Bogott: OpenStack: use rabbitmq Quorum queues [puppet] - 10https://gerrit.wikimedia.org/r/862323 (https://phabricator.wikimedia.org/T318816) [19:30:08] (03CR) 10BBlack: [C: 03+1] hiera: decommission dns5001 [puppet] - 10https://gerrit.wikimedia.org/r/862316 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [19:30:17] (03CR) 10BBlack: [C: 03+1] sites.yaml: remove dns5001 from anycast_neighbors [homer/public] - 10https://gerrit.wikimedia.org/r/862321 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [19:30:41] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [19:30:47] (03CR) 10BBlack: [C: 03+1] ntp/eqsin: move to dns5002 [dns] - 10https://gerrit.wikimedia.org/r/862318 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [19:31:20] (03PS2) 10Andrew Bogott: OpenStack: use rabbitmq Quorum queues [puppet] - 10https://gerrit.wikimedia.org/r/862323 (https://phabricator.wikimedia.org/T318816) [19:32:28] (03PS3) 10Andrew Bogott: OpenStack: use rabbitmq Quorum queues [puppet] - 10https://gerrit.wikimedia.org/r/862323 (https://phabricator.wikimedia.org/T318816) [19:33:22] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: use rabbitmq Quorum queues [puppet] - 10https://gerrit.wikimedia.org/r/862323 (https://phabricator.wikimedia.org/T318816) (owner: 10Andrew Bogott) [19:33:48] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:861484|Fix PHP notice (T324119)]] (duration: 05m 50s) [19:33:55] T324119: PHP Notice: Trying to access array offset on value of type null - https://phabricator.wikimedia.org/T324119 [19:34:03] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/GlobalBlocking] (wmf/1.40.0-wmf.10) - 10https://gerrit.wikimedia.org/r/861485 (https://phabricator.wikimedia.org/T324119) (owner: 10Ladsgroup) [19:34:26] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:861485|Fix PHP notice (T324119)]] [19:35:32] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:861485|Fix PHP notice (T324119)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [19:36:41] (03CR) 10BCornwall: [C: 03+2] cp5023: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861911 (https://phabricator.wikimedia.org/T322048) (owner: 10Ssingh) [19:37:01] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM for VRTS - https://phabricator.wikimedia.org/T324030 (10Arnoldokoth) [19:37:11] (03CR) 10Ssingh: [C: 03+2] ntp/eqsin: move to dns5002 [dns] - 10https://gerrit.wikimedia.org/r/862318 (https://phabricator.wikimedia.org/T323830) (owner: 10Ssingh) [19:37:45] !log: running authdns-update for Gerrit: 862318 (T323830) [19:37:46] T323830: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 [19:37:58] !log running authdns-update for Gerrit: 862318 (T323830) [19:38:01] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: codfw VM for VRTS - https://phabricator.wikimedia.org/T324030 (10Arnoldokoth) 05Open→03In progress p:05Triage→03High [19:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T323907)', diff saved to https://phabricator.wikimedia.org/P41907 and previous config saved to /var/cache/conftool/dbconfig/20221130-193814-ladsgroup.json [19:38:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:38:21] !log brett@cumin1001 START - Cookbook sre.hosts.reimage for host cp5023.eqsin.wmnet with OS buster [19:38:21] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [19:38:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:38:34] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host cp5023.eqsin.wmnet with OS buster [19:38:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41908 and previous config saved to /var/cache/conftool/dbconfig/20221130-193836-ladsgroup.json [19:39:58] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:861485|Fix PHP notice (T324119)]] (duration: 05m 32s) [19:40:05] T324119: PHP Notice: Trying to access array offset on value of type null - https://phabricator.wikimedia.org/T324119 [19:40:05] RECOVERY - SSH on mw1327.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:41:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [19:42:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance [19:42:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P41909 and previous config saved to /var/cache/conftool/dbconfig/20221130-194220-ladsgroup.json [19:42:27] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [19:43:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T322618)', diff saved to https://phabricator.wikimedia.org/P41910 and previous config saved to /var/cache/conftool/dbconfig/20221130-194328-ladsgroup.json [19:46:35] Thanks Amir1! [19:47:15] ^^ [19:47:21] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:48:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:48:06] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862324 (https://phabricator.wikimedia.org/T320517) [19:48:08] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862324 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [19:48:50] !log robh@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti5004.eqsin.wmnet with OS bullseye [19:48:50] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.12 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862324 (https://phabricator.wikimedia.org/T320517) (owner: 10TrainBranchBot) [19:48:59] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by robh@cumin2002 for host ganeti5004.eqsin.wmnet with OS bullseye [19:50:19] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:51:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:52:28] (03CR) 10Ssingh: [V: 03+1] P:cache::haproxy: harden systemd unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/861445 (https://phabricator.wikimedia.org/T323944) (owner: 10Ssingh) [19:53:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:55:17] (03PS1) 10AOkoth: vrts: add mac address for vrts2001 [puppet] - 10https://gerrit.wikimedia.org/r/862325 (https://phabricator.wikimedia.org/T323515) [19:55:33] PROBLEM - nova-compute proc maximum on cloudvirt-wdqs1003 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:56:14] !log dancy@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.12 refs T320517 [19:56:22] T320517: 1.40.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T320517 [19:56:45] (03CR) 10Dzahn: [C: 03+1] vrts: add mac address for vrts2001 [puppet] - 10https://gerrit.wikimedia.org/r/862325 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [19:57:13] (03CR) 10AOkoth: [C: 03+2] vrts: add mac address for vrts2001 [puppet] - 10https://gerrit.wikimedia.org/r/862325 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [19:58:31] PROBLEM - nova-compute proc maximum on cloudvirt-wdqs1001 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:58:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P41911 and previous config saved to /var/cache/conftool/dbconfig/20221130-195834-ladsgroup.json [20:00:56] PROBLEM - nova-compute proc minimum on cloudvirt-wdqs1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:01:38] !log dancy@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.12 refs T320517 (duration: 05m 23s) [20:01:45] T320517: 1.40.0-wmf.12 deployment blockers - https://phabricator.wikimedia.org/T320517 [20:01:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST pods) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:05:29] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BCornwall) @ayounsi I've started a tcpdump on the dns hosts to see what devices are still reaching out. It's on our radar and I intend on addressing the remaining hosts (or poking dcops to do it for u... [20:05:29] !log brett@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp5023.eqsin.wmnet with reason: host reimage [20:08:55] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5023.eqsin.wmnet with reason: host reimage [20:13:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P41912 and previous config saved to /var/cache/conftool/dbconfig/20221130-201341-ladsgroup.json [20:14:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [20:14:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [20:14:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2102.codfw.wmnet with reason: Maintenance [20:15:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2102.codfw.wmnet with reason: Maintenance [20:15:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [20:15:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [20:15:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2103 (T322618)', diff saved to https://phabricator.wikimedia.org/P41913 and previous config saved to /var/cache/conftool/dbconfig/20221130-201533-ladsgroup.json [20:15:40] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [20:16:29] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01821 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [20:16:45] RECOVERY - nova-compute proc minimum on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:16:46] RECOVERY - nova-compute proc maximum on cloudvirt-wdqs1003 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:16:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T323907)', diff saved to https://phabricator.wikimedia.org/P41914 and previous config saved to /var/cache/conftool/dbconfig/20221130-201653-ladsgroup.json [20:17:01] T323907: Make fr_user unsigned - https://phabricator.wikimedia.org/T323907 [20:17:20] (03PS2) 10Ssingh: cp5024: update site.pp and related configs for cp role [puppet] - 10https://gerrit.wikimedia.org/r/861912 (https://phabricator.wikimedia.org/T322048) [20:17:23] (03PS1) 10AOkoth: vrts: vrts2001 partman config [puppet] - 10https://gerrit.wikimedia.org/r/862327 (https://phabricator.wikimedia.org/T323515) [20:17:38] !log robh@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti5004.eqsin.wmnet with reason: host reimage [20:17:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T322618)', diff saved to https://phabricator.wikimedia.org/P41915 and previous config saved to /var/cache/conftool/dbconfig/20221130-201743-ladsgroup.json [20:18:10] (03CR) 10Dzahn: [C: 03+1] vrts: vrts2001 partman config [puppet] - 10https://gerrit.wikimedia.org/r/862327 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [20:18:29] (03CR) 10AOkoth: [C: 03+2] vrts: vrts2001 partman config [puppet] - 10https://gerrit.wikimedia.org/r/862327 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [20:18:51] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.0009843 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [23:19:16] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:22:04] !log brett@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp5025.eqsin.wmnet with OS buster [23:22:12] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host cp5025.eqsin.wmnet with OS buster completed: - cp5025 (**WARN**) -... [23:23:14] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:24:03] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1206.mgmt.eqiad.wmnet with reboot policy FORCED [23:24:12] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1049 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [23:24:42] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1206'] [23:26:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T322618)', diff saved to https://phabricator.wikimedia.org/P41960 and previous config saved to /var/cache/conftool/dbconfig/20221130-232637-ladsgroup.json [23:26:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance [23:26:45] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [23:26:48] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Papaul) [23:26:52] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance [23:26:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2146 (T322618)', diff saved to https://phabricator.wikimedia.org/P41961 and previous config saved to /var/cache/conftool/dbconfig/20221130-232658-ladsgroup.json [23:27:02] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10Papaul) @Jclark-ctr thanks [23:27:12] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [23:29:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T322618)', diff saved to https://phabricator.wikimedia.org/P41962 and previous config saved to /var/cache/conftool/dbconfig/20221130-232908-ladsgroup.json [23:29:17] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/862226 (https://phabricator.wikimedia.org/T324089) (owner: 10Filippo Giunchedi) [23:30:06] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2001'] [23:30:10] (03CR) 10Cwhite: [C: 03+2] install_server: set eqiad bullseye vms to install bullseye [puppet] - 10https://gerrit.wikimedia.org/r/861871 (https://phabricator.wikimedia.org/T321410) (owner: 10Cwhite) [23:30:18] (03PS2) 10Cwhite: install_server: set eqiad bullseye vms to install bullseye [puppet] - 10https://gerrit.wikimedia.org/r/861871 (https://phabricator.wikimedia.org/T321410) [23:30:46] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest2001'] [23:32:40] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1206'] [23:32:54] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1206'] [23:33:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118', diff saved to https://phabricator.wikimedia.org/P41963 and previous config saved to /var/cache/conftool/dbconfig/20221130-233314-ladsgroup.json [23:33:21] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1206'] [23:35:56] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1206'] [23:36:12] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db1206'] [23:36:48] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1206'] [23:38:19] 10Puppet, 10SRE, 10Infrastructure-Foundations: Write, publish and deploy puppet-lint plug-in for ensure attribute bareword check - https://phabricator.wikimedia.org/T95377 (10Dzahn) @jbond and all. I wonder what you would think about this now in 2022. Are the barewords (ensure => link,) good and the single... [23:39:08] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908 (10Dzahn) @jbond and all. I wonder what you would think about this now in 2022. Are the barewords (ensure => link,) good and the single quotes bad as... [23:40:51] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.082 second response time https://wikitech.wikimedia.org/wiki/Swift [23:43:30] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift [23:44:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P41964 and previous config saved to /var/cache/conftool/dbconfig/20221130-234414-ladsgroup.json [23:44:51] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops, 10serviceops-collab: Create a puppet define for systemd timers - https://phabricator.wikimedia.org/T111031 (10Dzahn) [23:45:27] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops, 10serviceops-collab: Create a puppet define for systemd timers - https://phabricator.wikimedia.org/T111031 (10Dzahn) Nowadays in 2022 we have one, it's called `systemd::timer::job`, was created by @joe in https://gerrit.wikimedia.org/r/c/operat... [23:46:05] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10serviceops, 10serviceops-collab: Create a puppet define for systemd timers - https://phabricator.wikimedia.org/T111031 (10Dzahn) 05Open→03Resolved a:03Dzahn closing out old SRE tickets. seems resolved to me. please reopen if you disagree. [23:48:10] PROBLEM - Swift https frontend on ms-fe1011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [23:48:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1118 (T322618)', diff saved to https://phabricator.wikimedia.org/P41965 and previous config saved to /var/cache/conftool/dbconfig/20221130-234821-ladsgroup.json [23:48:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [23:48:32] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [23:48:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1119.eqiad.wmnet with reason: Maintenance [23:48:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T322618)', diff saved to https://phabricator.wikimedia.org/P41966 and previous config saved to /var/cache/conftool/dbconfig/20221130-234844-ladsgroup.json [23:48:49] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['db1206'] [23:49:10] 10Puppet, 10SRE, 10Infrastructure-Foundations: Ensure that there are no firewall rules in modules - https://phabricator.wikimedia.org/T114209 (10Dzahn) status nowadays: ` ~/puppet/modules$ grep -r ferm::service * | grep -v profile acme_chief/manifests/server.pp: ferm::service { 'acme-chief-api': acme_c... [23:49:36] RECOVERY - Swift https frontend on ms-fe1011 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift [23:49:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T322618)', diff saved to https://phabricator.wikimedia.org/P41967 and previous config saved to /var/cache/conftool/dbconfig/20221130-234952-ladsgroup.json [23:54:24] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db1206.eqiad.wmnet with OS bullseye [23:54:31] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1206 - https://phabricator.wikimedia.org/T322256 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db1206.eqiad.wmnet with OS bullseye [23:59:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P41968 and previous config saved to /var/cache/conftool/dbconfig/20221130-235921-ladsgroup.json