[00:42:23] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:35] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [01:38:53] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:40:31] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [01:48:53] (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:53:53] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:17] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:19] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:53] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:18:53] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:51:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:56:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:08:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:23:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:45:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:00:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:08:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:13:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:19:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:24:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [05:45:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:00:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:02:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2173', diff saved to https://phabricator.wikimedia.org/P39372 and previous config saved to /var/cache/conftool/dbconfig/20221114-060207-root.json [06:07:42] (03PS1) 10Marostegui: db2173: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856199 (https://phabricator.wikimedia.org/T322987) [06:08:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1191.eqiad.wmnet with reason: Maintenance [06:08:41] (03CR) 10Marostegui: [C: 03+2] db2173: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856199 (https://phabricator.wikimedia.org/T322987) (owner: 10Marostegui) [06:08:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1191.eqiad.wmnet with reason: Maintenance [06:08:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T321130)', diff saved to https://phabricator.wikimedia.org/P39373 and previous config saved to /var/cache/conftool/dbconfig/20221114-060847-marostegui.json [06:08:51] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [06:09:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:11:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321130)', diff saved to https://phabricator.wikimedia.org/P39374 and previous config saved to /var/cache/conftool/dbconfig/20221114-061100-marostegui.json [06:14:05] 10ops-codfw: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Marostegui) [06:14:20] 10ops-codfw: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Marostegui) p:05Triage→03Medium [06:21:27] 10ops-codfw: db2173 HW errors - https://phabricator.wikimedia.org/T322988 (10Marostegui) I am trying to poweron the host but it is not working: ` racadm>>serveraction powerup Server power operation initiated successfully racadm>>serveraction powerstatus Server power status: OFF ` [06:22:22] (03PS1) 10Marostegui: db2094: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856375 (https://phabricator.wikimedia.org/T322987) [06:23:17] (03CR) 10Marostegui: [C: 03+2] db2094: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856375 (https://phabricator.wikimedia.org/T322987) (owner: 10Marostegui) [06:24:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:26:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P39375 and previous config saved to /var/cache/conftool/dbconfig/20221114-062607-marostegui.json [06:29:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:34:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:35:10] ACKNOWLEDGEMENT - MariaDB Replica IO: s1 on db2094 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2173.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2173.codfw.wmnet (110 Connection timed out) Marostegui T322987 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:35:10] ACKNOWLEDGEMENT - MariaDB Replica Lag: s1 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 129264.88 seconds Marostegui T322987 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:37:13] PROBLEM - Check systemd state on ms-be1062 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:41:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P39376 and previous config saved to /var/cache/conftool/dbconfig/20221114-064113-marostegui.json [06:45:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:53:03] RECOVERY - Check systemd state on ms-be1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:56:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T321130)', diff saved to https://phabricator.wikimedia.org/P39377 and previous config saved to /var/cache/conftool/dbconfig/20221114-065620-marostegui.json [06:56:25] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [07:05:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:08:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2118.codfw.wmnet with reason: Maintenance [07:08:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2118.codfw.wmnet with reason: Maintenance [07:13:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:13:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:13:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:21:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1121.eqiad.wmnet with reason: Maintenance [07:21:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1121.eqiad.wmnet with reason: Maintenance [07:21:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:21:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:22:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T321130)', diff saved to https://phabricator.wikimedia.org/P39378 and previous config saved to /var/cache/conftool/dbconfig/20221114-072203-marostegui.json [07:22:08] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [07:23:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:25:12] RECOVERY - cassandra-b CQL 10.64.0.213:9042 on aqs1016 is OK: TCP OK - 0.000 second response time on 10.64.0.213 port 9042 https://phabricator.wikimedia.org/T93886 [07:29:23] (03PS1) 10Slyngshede: C:spamassassin disable email notification on failure. [puppet] - 10https://gerrit.wikimedia.org/r/856472 [07:29:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:34:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:36:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T321130)', diff saved to https://phabricator.wikimedia.org/P39379 and previous config saved to /var/cache/conftool/dbconfig/20221114-073624-marostegui.json [07:36:30] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [07:39:49] (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856473 (https://phabricator.wikimedia.org/T322295) [07:41:15] (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856473 (https://phabricator.wikimedia.org/T322295) (owner: 10Marostegui) [07:41:18] (03PS1) 10Marostegui: pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856474 (https://phabricator.wikimedia.org/T322295) [07:41:58] (03CR) 10Marostegui: [C: 03+2] pc2011: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856474 (https://phabricator.wikimedia.org/T322295) (owner: 10Marostegui) [07:42:00] (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc1 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856473 (https://phabricator.wikimedia.org/T322295) (owner: 10Marostegui) [07:42:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856473 (https://phabricator.wikimedia.org/T322295) (owner: 10Marostegui) [07:42:25] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:856473|ProductionServices.php: Promote pc2014 to pc1 master (T322295)]] [07:42:33] T322295: Migrate pc1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T322295 [07:42:50] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:856473|ProductionServices.php: Promote pc2014 to pc1 master (T322295)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [07:43:28] (03PS1) 10Marostegui: pc2011: Migrate from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/856475 (https://phabricator.wikimedia.org/T322295) [07:44:18] (03CR) 10Marostegui: [C: 03+2] pc2011: Migrate from 10.4 to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/856475 (https://phabricator.wikimedia.org/T322295) (owner: 10Marostegui) [07:45:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:47:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1033.eqiad.wmnet to cluster eqiad and group D [07:47:39] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:856473|ProductionServices.php: Promote pc2014 to pc1 master (T322295)]] (duration: 05m 14s) [07:47:45] T322295: Migrate pc1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T322295 [07:47:48] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1033.eqiad.wmnet to cluster eqiad and group D [07:50:44] (03CR) 10Elukey: [C: 03+1] k8s: Stop docker/runc spam from being written to syslog [puppet] - 10https://gerrit.wikimedia.org/r/855969 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [07:50:57] !log draining ganeti1021 for eventual reimage T311687 [07:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:02] T311687: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 [07:51:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P39380 and previous config saved to /var/cache/conftool/dbconfig/20221114-075131-marostegui.json [07:54:29] (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855741 [07:55:07] (03PS1) 10Marostegui: pc2011: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856476 (https://phabricator.wikimedia.org/T322295) [07:55:48] ACKNOWLEDGEMENT - SSH on db2173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds Marostegui T322987 https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:56:11] (03CR) 10Marostegui: [C: 03+2] pc2011: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856476 (https://phabricator.wikimedia.org/T322295) (owner: 10Marostegui) [07:56:23] (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855741 (owner: 10Marostegui) [07:57:05] (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc1 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855741 (owner: 10Marostegui) [07:57:11] (03PS7) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [07:57:20] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:57:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by marostegui@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855741 (owner: 10Marostegui) [07:57:36] !log marostegui@deploy1002 Started scap: Backport for [[gerrit:855741|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] [07:57:57] !log marostegui@deploy1002 marostegui and marostegui: Backport for [[gerrit:855741|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [08:00:04] (03CR) 10Elukey: [C: 04-1] "I checked all the cidrs and the ML ones are wrong afaics :( I think that we never cleaned up the old /23-/24 allocations when we moved to " [puppet] - 10https://gerrit.wikimedia.org/r/855997 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [08:00:05] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221114T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:00:22] (03CR) 10Elukey: [C: 03+1] "Checked all the cidrs, looks good." [puppet] - 10https://gerrit.wikimedia.org/r/855999 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [08:00:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:00:59] (03CR) 10Elukey: [C: 03+2] Add a basic puppetization for Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854487 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [08:01:38] (03PS21) 10Elukey: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) [08:02:11] !log marostegui@deploy1002 Finished scap: Backport for [[gerrit:855741|Revert "ProductionServices.php: Promote pc2014 to pc1 master"]] (duration: 04m 34s) [08:06:06] (03CR) 10Jelto: [C: 03+1] "lgtm now" [puppet] - 10https://gerrit.wikimedia.org/r/852831 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:06:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P39381 and previous config saved to /var/cache/conftool/dbconfig/20221114-080637-marostegui.json [08:17:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:20:46] !log installing php7.4 security updates (as packaged in Debian) [08:20:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T321130)', diff saved to https://phabricator.wikimedia.org/P39383 and previous config saved to /var/cache/conftool/dbconfig/20221114-082144-marostegui.json [08:21:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1141.eqiad.wmnet with reason: Maintenance [08:21:49] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [08:21:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1141.eqiad.wmnet with reason: Maintenance [08:22:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T321130)', diff saved to https://phabricator.wikimedia.org/P39384 and previous config saved to /var/cache/conftool/dbconfig/20221114-082205-marostegui.json [08:22:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:24:36] (03PS1) 10Marostegui: db2145: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856477 [08:24:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2145 T322620', diff saved to https://phabricator.wikimedia.org/P39385 and previous config saved to /var/cache/conftool/dbconfig/20221114-082458-root.json [08:25:03] T322620: Compile and package MariaDB 10.4.27 and 10.6.11 - https://phabricator.wikimedia.org/T322620 [08:25:32] (03CR) 10Marostegui: [C: 03+2] db2145: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856477 (owner: 10Marostegui) [08:27:43] (03CR) 10Jelto: "left a small note in-line about variable names" [puppet] - 10https://gerrit.wikimedia.org/r/856013 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [08:28:40] RECOVERY - DPKG on netmon1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:33:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T321130)', diff saved to https://phabricator.wikimedia.org/P39386 and previous config saved to /var/cache/conftool/dbconfig/20221114-083352-marostegui.json [08:33:58] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [08:34:17] (03PS1) 10Marostegui: Revert "db2145: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/855742 [08:35:39] (03CR) 10Marostegui: [C: 03+2] Revert "db2145: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/855742 (owner: 10Marostegui) [08:36:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 1%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39387 and previous config saved to /var/cache/conftool/dbconfig/20221114-083620-root.json [08:48:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P39388 and previous config saved to /var/cache/conftool/dbconfig/20221114-084859-marostegui.json [08:49:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:51:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 3%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39389 and previous config saved to /var/cache/conftool/dbconfig/20221114-085125-root.json [08:58:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10fgiunchedi) >>! In T322147#8389743, @ILooremeta-WMF wrote: > @fgiunchedi what would the email read like, please? I think I might have lost it in the m... [08:59:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:04:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P39391 and previous config saved to /var/cache/conftool/dbconfig/20221114-090406-marostegui.json [09:06:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 5%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39392 and previous config saved to /var/cache/conftool/dbconfig/20221114-090630-root.json [09:07:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:12:11] (03CR) 10Filippo Giunchedi: "Thank you! I've run PCC on the hosts that run profile::mariadb::ferm_misc and they fail:" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [09:12:26] (03CR) 10Filippo Giunchedi: [C: 04-1] "-1 because change will fail, LGTM otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [09:12:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:14:38] (03PS8) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [09:16:32] (03CR) 10Slyngshede: Initial checkin. User and Group classes for interacting with LDAP. (0336 comments) [debs/python-wmf-ldap] - 10https://gerrit.wikimedia.org/r/820601 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [09:16:41] (03CR) 10Filippo Giunchedi: [C: 03+1] netmon: Add netmon2002 to the alertmanager rw api (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854974 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [09:16:57] (03CR) 10Slyngshede: [C: 03+2] C:idm::deployment logrotation for Django logs. [puppet] - 10https://gerrit.wikimedia.org/r/853283 (https://phabricator.wikimedia.org/T320431) (owner: 10Slyngshede) [09:18:39] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 13335 [09:18:45] !log ayounsi@cumin1001 END (ERROR) - Cookbook sre.network.peering (exit_code=97) with action 'configure' for AS: 13335 [09:19:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T321130)', diff saved to https://phabricator.wikimedia.org/P39393 and previous config saved to /var/cache/conftool/dbconfig/20221114-091912-marostegui.json [09:19:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1142.eqiad.wmnet with reason: Maintenance [09:19:17] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:19:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1142.eqiad.wmnet with reason: Maintenance [09:19:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T321130)', diff saved to https://phabricator.wikimedia.org/P39394 and previous config saved to /var/cache/conftool/dbconfig/20221114-091934-marostegui.json [09:20:05] (03PS9) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [09:21:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 10%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39395 and previous config saved to /var/cache/conftool/dbconfig/20221114-092135-root.json [09:22:19] PROBLEM - SSH on mw1330.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:23:48] (03PS1) 10Muehlenhoff: Sync to 6.6.2 of the CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/856480 [09:24:39] (03CR) 10Muehlenhoff: Sync to 6.6.2 of the CAS overlay (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/854998 (owner: 10Muehlenhoff) [09:24:45] (03CR) 10Filippo Giunchedi: "I completely missed the bogus syntax back when I sent the additional columns, fixed in this review" [puppet] - 10https://gerrit.wikimedia.org/r/856478 (owner: 10Filippo Giunchedi) [09:25:47] (03CR) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. (0310 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [09:28:37] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38126/console" [puppet] - 10https://gerrit.wikimedia.org/r/853276 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [09:30:15] (03CR) 10Ladsgroup: [C: 03+1] add_cul_actor_T321126.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/855959 (https://phabricator.wikimedia.org/T321126) (owner: 10Marostegui) [09:30:33] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38127/console" [puppet] - 10https://gerrit.wikimedia.org/r/853277 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [09:31:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T321130)', diff saved to https://phabricator.wikimedia.org/P39396 and previous config saved to /var/cache/conftool/dbconfig/20221114-093118-marostegui.json [09:31:23] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [09:31:52] 10SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Deoband Community Wikimedia - https://phabricator.wikimedia.org/T322996 (10Ladsgroup) do you want it public or private? [09:33:17] (03PS1) 10Ladsgroup: Rework SpecialPagesWithoutScans query [extensions/ProofreadPage] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855743 (https://phabricator.wikimedia.org/T322849) [09:33:51] (03CR) 10Volans: [C: 04-1] "This doesn't seem to be needed." [dns] - 10https://gerrit.wikimedia.org/r/856065 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [09:34:17] jouncebot: nowandnext [09:34:17] No deployments scheduled for the next 4 hour(s) and 25 minute(s) [09:34:17] In 4 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221114T1400) [09:34:24] Awesome [09:34:32] (03CR) 10Ladsgroup: [C: 03+2] Rework SpecialPagesWithoutScans query [extensions/ProofreadPage] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855743 (https://phabricator.wikimedia.org/T322849) (owner: 10Ladsgroup) [09:34:53] (03PS2) 10Muehlenhoff: postgres: Remove support for Stretch [puppet] - 10https://gerrit.wikimedia.org/r/812230 [09:35:52] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 32934 [09:35:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:36:40] 10SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Deoband Community Wikimedia - https://phabricator.wikimedia.org/T322996 (10TheAafi) >>! In T322996#8391475, @Ladsgroup wrote: > do you want it public or private? Public. [09:36:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 25%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39397 and previous config saved to /var/cache/conftool/dbconfig/20221114-093640-root.json [09:37:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32934 [09:38:04] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 50083 [09:38:18] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 50083 [09:38:42] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4788 [09:39:04] 10SRE, 10SRE-Access-Requests, 10User-Ryasmeen: Requesting access to analytics-privatedata-users for ryasmeen (superset access with no server access) - https://phabricator.wikimedia.org/T322795 (10Volans) @fgiunchedi the above patch is missing the `ssh_keys` key and that broke the `/usr/local/bin/cross-valida... [09:39:19] 10SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Deoband Community Wikimedia - https://phabricator.wikimedia.org/T322996 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Done: https://lists.wikimedia.org/postorius/lists/wikimedia-dcw.lists.wikimedia.org [09:39:23] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/812230 (owner: 10Muehlenhoff) [09:39:29] can someone with op here remove me from clinic duty and put jynus instead? thank you [09:39:41] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4788 [09:39:50] godog: I can do that, sorry [09:40:07] I got distracted by other stuff [09:40:24] sure np, thanks jynus [09:40:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:40:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence-Backup: Q2:rack/setup/install dbprov1004 - https://phabricator.wikimedia.org/T321122 (10cmooney) @Jclark-ctr FYI I pushed the config for this port to the switch with Homer now. [09:44:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti1033.eqiad.wmnet to cluster eqiad and group D [09:46:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P39398 and previous config saved to /var/cache/conftool/dbconfig/20221114-094624-marostegui.json [09:47:40] (03PS1) 10Vgutierrez: sslcert::update-ocsp: Stop using SafeConfigParser [puppet] - 10https://gerrit.wikimedia.org/r/856483 (https://phabricator.wikimedia.org/T321309) [09:51:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 50%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39399 and previous config saved to /var/cache/conftool/dbconfig/20221114-095145-root.json [09:52:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:55:18] (03PS1) 10Filippo Giunchedi: admin: add ssh_keys for ryasmeen [puppet] - 10https://gerrit.wikimedia.org/r/856484 (https://phabricator.wikimedia.org/T322795) [09:55:20] (03PS1) 10Filippo Giunchedi: admin: validate human users have ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/856485 (https://phabricator.wikimedia.org/T322795) [09:55:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Ryasmeen: Requesting access to analytics-privatedata-users for ryasmeen (superset access with no server access) - https://phabricator.wikimedia.org/T322795 (10fgiunchedi) Interesting, thank you for the heads up @Volans . There's an obvious disconnect... [09:55:45] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10jcrespo) [09:55:55] (03Merged) 10jenkins-bot: Rework SpecialPagesWithoutScans query [extensions/ProofreadPage] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855743 (https://phabricator.wikimedia.org/T322849) (owner: 10Ladsgroup) [09:55:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10jcrespo) [09:56:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [09:56:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2098.codfw.wmnet with reason: Maintenance [09:56:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance [09:56:35] !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 6:00:00 on db2100.codfw.wmnet with reason: Maintenance [09:57:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [09:57:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [09:57:28] (03PS2) 10JMeybohm: k8s: make profile::kubernetes::cluster_cidr mandatory [puppet] - 10https://gerrit.wikimedia.org/r/855997 (https://phabricator.wikimedia.org/T307943) [09:57:30] (03PS2) 10JMeybohm: k8s: Refactor profile::kubernetes::master::service_cluster_ip_range [puppet] - 10https://gerrit.wikimedia.org/r/855999 (https://phabricator.wikimedia.org/T307943) [09:57:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/ProofreadPage] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855743 (https://phabricator.wikimedia.org/T322849) (owner: 10Ladsgroup) [09:57:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Ryasmeen: Requesting access to analytics-privatedata-users for ryasmeen (superset access with no server access) - https://phabricator.wikimedia.org/T322795 (10Volans) I'm ok either way, probably I'd go the CI way too, but check with moritz/john on th... [09:57:47] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:855743|Rework SpecialPagesWithoutScans query (T322849)]] [09:57:51] T322849: Proofread's SpecialPagesWithoutScans makes extremely slow queries - https://phabricator.wikimedia.org/T322849 [09:57:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [09:58:08] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:855743|Rework SpecialPagesWithoutScans query (T322849)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [09:58:16] (03CR) 10JMeybohm: k8s: make profile::kubernetes::cluster_cidr mandatory (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/855997 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [09:59:30] (03CR) 10JMeybohm: [C: 03+1] Enable profile::auto_restarts::service for jwt-authorizer on docker registry [puppet] - 10https://gerrit.wikimedia.org/r/852831 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:01:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P39400 and previous config saved to /var/cache/conftool/dbconfig/20221114-100131-marostegui.json [10:02:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [10:02:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [10:02:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [10:02:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2104.codfw.wmnet with reason: Maintenance [10:02:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T322618)', diff saved to https://phabricator.wikimedia.org/P39401 and previous config saved to /var/cache/conftool/dbconfig/20221114-100254-ladsgroup.json [10:02:59] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:05:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T322618)', diff saved to https://phabricator.wikimedia.org/P39402 and previous config saved to /var/cache/conftool/dbconfig/20221114-100515-ladsgroup.json [10:05:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:06:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10jcrespo) While we wait for Jcross to return to office, @Ottomata it would help if you could approve on your side. Access to the above group until June 2023. [10:06:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:06:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 75%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39403 and previous config saved to /var/cache/conftool/dbconfig/20221114-100650-root.json [10:06:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:06:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10jcrespo) [10:07:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:07:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1105.eqiad.wmnet with reason: Maintenance [10:07:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39404 and previous config saved to /var/cache/conftool/dbconfig/20221114-100720-ladsgroup.json [10:07:29] !log upload acme-chief 0.35 to apt.wm.o (buster-wikimedia) - T244232 T262251 [10:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:34] T262251: acme-chief shouldn't try to perform OCSP stapling of expired certs - https://phabricator.wikimedia.org/T262251 [10:07:35] T244232: acme-chief should be able to refresh OCSP stapling response even if the renewal process fails - https://phabricator.wikimedia.org/T244232 [10:09:04] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:855743|Rework SpecialPagesWithoutScans query (T322849)]] (duration: 11m 17s) [10:09:09] T322849: Proofread's SpecialPagesWithoutScans makes extremely slow queries - https://phabricator.wikimedia.org/T322849 [10:09:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39405 and previous config saved to /var/cache/conftool/dbconfig/20221114-100931-ladsgroup.json [10:09:36] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:09:50] 10SRE, 10Infrastructure-Foundations: IDM: Central logging on all changes - https://phabricator.wikimedia.org/T320431 (10SLyngshede-WMF) 05In progress→03Resolved [10:09:52] 10SRE, 10Infrastructure-Foundations: IDM milestone 1 "Initial development work" - https://phabricator.wikimedia.org/T319407 (10SLyngshede-WMF) [10:11:28] (03PS1) 10Slyngshede: P:idm fix service fqdn variable naming. [puppet] - 10https://gerrit.wikimedia.org/r/856509 [10:12:09] !log upgrading acme-chief on acmechief1001 to version 0.35 (requires disabling puppet on R:acme_chief::cert) [10:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:03] taavi: ^^ that release ships your changes to support Puppet 7 [10:13:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] modules/base.kubernetes: add module [deployment-charts] - 10https://gerrit.wikimedia.org/r/855667 (owner: 10Giuseppe Lavagetto) [10:14:21] (03CR) 10Slyngshede: [C: 03+2] P:idm fix service fqdn variable naming. [puppet] - 10https://gerrit.wikimedia.org/r/856509 (owner: 10Slyngshede) [10:15:36] (03CR) 10Muehlenhoff: "Getting close, another round of comments" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [10:15:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:16:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T321130)', diff saved to https://phabricator.wikimedia.org/P39406 and previous config saved to /var/cache/conftool/dbconfig/20221114-101637-marostegui.json [10:16:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1143.eqiad.wmnet with reason: Maintenance [10:16:42] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:16:50] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for jwt-authorizer on docker registry (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852831 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:16:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1143.eqiad.wmnet with reason: Maintenance [10:16:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T321130)', diff saved to https://phabricator.wikimedia.org/P39407 and previous config saved to /var/cache/conftool/dbconfig/20221114-101659-marostegui.json [10:19:46] (03Merged) 10jenkins-bot: modules/base.kubernetes: add module [deployment-charts] - 10https://gerrit.wikimedia.org/r/855667 (owner: 10Giuseppe Lavagetto) [10:20:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P39408 and previous config saved to /var/cache/conftool/dbconfig/20221114-102021-ladsgroup.json [10:21:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2145 (re)pooling @ 100%: After upgrade', diff saved to https://phabricator.wikimedia.org/P39409 and previous config saved to /var/cache/conftool/dbconfig/20221114-102155-root.json [10:22:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856485 (https://phabricator.wikimedia.org/T322795) (owner: 10Filippo Giunchedi) [10:23:02] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856484 (https://phabricator.wikimedia.org/T322795) (owner: 10Filippo Giunchedi) [10:23:09] RECOVERY - SSH on mw1330.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:24:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P39410 and previous config saved to /var/cache/conftool/dbconfig/20221114-102437-ladsgroup.json [10:27:58] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856478 (owner: 10Filippo Giunchedi) [10:28:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T321130)', diff saved to https://phabricator.wikimedia.org/P39411 and previous config saved to /var/cache/conftool/dbconfig/20221114-102853-marostegui.json [10:28:58] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [10:29:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:33:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856472 (owner: 10Slyngshede) [10:33:50] (03PS1) 10Slyngshede: P:idm add missing directories for static and media content. [puppet] - 10https://gerrit.wikimedia.org/r/856513 [10:34:11] (03CR) 10Slyngshede: [C: 03+2] C:spamassassin disable email notification on failure. [puppet] - 10https://gerrit.wikimedia.org/r/856472 (owner: 10Slyngshede) [10:34:36] (03CR) 10Marostegui: [C: 03+2] add_cul_actor_T321126.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/855959 (https://phabricator.wikimedia.org/T321126) (owner: 10Marostegui) [10:34:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:35:21] (03Merged) 10jenkins-bot: add_cul_actor_T321126.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/855959 (https://phabricator.wikimedia.org/T321126) (owner: 10Marostegui) [10:35:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P39412 and previous config saved to /var/cache/conftool/dbconfig/20221114-103528-ladsgroup.json [10:36:44] (03CR) 10Slyngshede: [C: 03+2] P:idm add missing directories for static and media content. [puppet] - 10https://gerrit.wikimedia.org/r/856513 (owner: 10Slyngshede) [10:39:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1096.eqiad.wmnet with reason: Maintenance [10:39:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1096.eqiad.wmnet with reason: Maintenance [10:39:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P39413 and previous config saved to /var/cache/conftool/dbconfig/20221114-103944-ladsgroup.json [10:39:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39414 and previous config saved to /var/cache/conftool/dbconfig/20221114-103953-marostegui.json [10:39:58] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [10:41:49] (03PS1) 10Muehlenhoff: Fix up comments [puppet] - 10https://gerrit.wikimedia.org/r/856514 (https://phabricator.wikimedia.org/T273673) [10:42:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39415 and previous config saved to /var/cache/conftool/dbconfig/20221114-104209-marostegui.json [10:44:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P39416 and previous config saved to /var/cache/conftool/dbconfig/20221114-104400-marostegui.json [10:44:53] 10SRE, 10Release Pipeline, 10Maps (Kartotherian): Make jobprocessor's test not depend on external files - https://phabricator.wikimedia.org/T231009 (10jijiki) 05Open→03Invalid After discussing @Jgiannelos, we can mark this as invalid as it is quite old [10:44:56] 10SRE, 10Release Pipeline, 10Maps (Kartotherian): Create blubberfile for deploying kartotherian into docker environment. - https://phabricator.wikimedia.org/T223275 (10jijiki) [10:46:06] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: validate human users have ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/856485 (https://phabricator.wikimedia.org/T322795) (owner: 10Filippo Giunchedi) [10:46:09] (03CR) 10Filippo Giunchedi: [C: 03+2] admin: add ssh_keys for ryasmeen [puppet] - 10https://gerrit.wikimedia.org/r/856484 (https://phabricator.wikimedia.org/T322795) (owner: 10Filippo Giunchedi) [10:46:20] (03PS2) 10Filippo Giunchedi: admin: add ssh_keys for ryasmeen [puppet] - 10https://gerrit.wikimedia.org/r/856484 (https://phabricator.wikimedia.org/T322795) [10:47:01] (03CR) 10Hnowlan: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/812230 (owner: 10Muehlenhoff) [10:47:37] (03PS2) 10Filippo Giunchedi: admin: validate human users have ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/856485 (https://phabricator.wikimedia.org/T322795) [10:47:43] (03CR) 10Filippo Giunchedi: [V: 03+2] admin: validate human users have ssh_keys [puppet] - 10https://gerrit.wikimedia.org/r/856485 (https://phabricator.wikimedia.org/T322795) (owner: 10Filippo Giunchedi) [10:48:01] (03PS1) 10Hnowlan: result_storage.swift: re-enable logging of statements during swift requests [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/856515 (https://phabricator.wikimedia.org/T233196) [10:48:07] (03CR) 10Filippo Giunchedi: [C: 03+2] cfssl: fix sqlite initdb [puppet] - 10https://gerrit.wikimedia.org/r/856478 (owner: 10Filippo Giunchedi) [10:49:23] (03CR) 10Filippo Giunchedi: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853278 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [10:50:14] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856514 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff) [10:50:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T322618)', diff saved to https://phabricator.wikimedia.org/P39417 and previous config saved to /var/cache/conftool/dbconfig/20221114-105034-ladsgroup.json [10:50:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [10:50:39] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:50:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [10:50:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T322618)', diff saved to https://phabricator.wikimedia.org/P39418 and previous config saved to /var/cache/conftool/dbconfig/20221114-105056-ladsgroup.json [10:51:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:53:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T322618)', diff saved to https://phabricator.wikimedia.org/P39419 and previous config saved to /var/cache/conftool/dbconfig/20221114-105317-ladsgroup.json [10:54:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39420 and previous config saved to /var/cache/conftool/dbconfig/20221114-105450-ladsgroup.json [10:54:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [10:55:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1129.eqiad.wmnet with reason: Maintenance [10:55:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T322618)', diff saved to https://phabricator.wikimedia.org/P39421 and previous config saved to /var/cache/conftool/dbconfig/20221114-105512-ladsgroup.json [10:55:18] (03PS6) 10Filippo Giunchedi: cfssl: introduce cfssl::csr [puppet] - 10https://gerrit.wikimedia.org/r/853276 (https://phabricator.wikimedia.org/T319163) [10:55:20] (03PS6) 10Filippo Giunchedi: pki: generate CSR for root_ca [puppet] - 10https://gerrit.wikimedia.org/r/853277 (https://phabricator.wikimedia.org/T319163) [10:55:22] (03PS6) 10Filippo Giunchedi: pki: run initca for root_ca when needed [puppet] - 10https://gerrit.wikimedia.org/r/853278 (https://phabricator.wikimedia.org/T319163) [10:57:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P39422 and previous config saved to /var/cache/conftool/dbconfig/20221114-105716-marostegui.json [10:57:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T322618)', diff saved to https://phabricator.wikimedia.org/P39423 and previous config saved to /var/cache/conftool/dbconfig/20221114-105723-ladsgroup.json [10:57:28] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [10:59:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P39424 and previous config saved to /var/cache/conftool/dbconfig/20221114-105906-marostegui.json [10:59:12] sorry for the incoming gerrit spam [10:59:22] (or not) [11:01:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:07:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:08:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P39425 and previous config saved to /var/cache/conftool/dbconfig/20221114-110824-ladsgroup.json [11:09:15] (03PS1) 10Clément Goubert: mw-web: raise ns resource quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/856516 [11:09:26] (03PS2) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/849508 (https://phabricator.wikimedia.org/T320696) [11:10:09] (03PS3) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/849508 (https://phabricator.wikimedia.org/T320696) [11:10:29] (03PS4) 10Jbond: P:netbox::host: create a motd for the status [puppet] - 10https://gerrit.wikimedia.org/r/849508 (https://phabricator.wikimedia.org/T320696) [11:12:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P39426 and previous config saved to /var/cache/conftool/dbconfig/20221114-111222-marostegui.json [11:12:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P39427 and previous config saved to /var/cache/conftool/dbconfig/20221114-111229-ladsgroup.json [11:12:43] (03CR) 10Jbond: "we should ensure the following is merged before merging this" [puppet] - 10https://gerrit.wikimedia.org/r/849508 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [11:12:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:13:49] (03PS3) 10Giuseppe Lavagetto: Add rake task to perform basic conversions [deployment-charts] - 10https://gerrit.wikimedia.org/r/855668 [11:13:51] (03PS1) 10Giuseppe Lavagetto: Add rake task to convert deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/856517 [11:14:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T321130)', diff saved to https://phabricator.wikimedia.org/P39428 and previous config saved to /var/cache/conftool/dbconfig/20221114-111412-marostegui.json [11:14:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:14:18] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [11:14:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1144.eqiad.wmnet with reason: Maintenance [11:14:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T321130)', diff saved to https://phabricator.wikimedia.org/P39429 and previous config saved to /var/cache/conftool/dbconfig/20221114-111434-marostegui.json [11:15:26] (03PS1) 10Jcrespo: cross-validate-users: Fix exception when user has not ssh key defined [puppet] - 10https://gerrit.wikimedia.org/r/856518 [11:16:07] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi) [11:17:10] jynus: FYI I've added validation to CI that ssh_keys exists instead [11:18:05] (03PS3) 10Vgutierrez: acme-chief: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff) [11:18:22] (03CR) 10Muehlenhoff: [C: 03+2] Fix up comments [puppet] - 10https://gerrit.wikimedia.org/r/856514 (https://phabricator.wikimedia.org/T273673) (owner: 10Muehlenhoff) [11:19:53] (03PS2) 10Ladsgroup: Re-add s11 in db config reload callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854509 (https://phabricator.wikimedia.org/T322598) [11:19:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:20:17] jouncebot: nowandnext [11:20:17] No deployments scheduled for the next 2 hour(s) and 39 minute(s) [11:20:17] In 2 hour(s) and 39 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221114T1400) [11:20:20] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/856483 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [11:20:30] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38128/console" [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff) [11:20:32] godog, volans Ah, I see [11:20:42] I thought I had created the problem [11:20:54] (03CR) 10Vgutierrez: [C: 03+2] sslcert::update-ocsp: Stop using SafeConfigParser [puppet] - 10https://gerrit.wikimedia.org/r/856483 (https://phabricator.wikimedia.org/T321309) (owner: 10Vgutierrez) [11:20:56] and sending a naive fix [11:21:15] I didn't know about T322795#8391495 at all [11:21:34] hehe yeah I did introduce the problem last week [11:21:49] so now my patch is unneeded [11:22:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:22:07] I think so yeah [11:22:44] I had no context of that [11:22:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854509 (https://phabricator.wikimedia.org/T322598) (owner: 10Ladsgroup) [11:23:17] (03PS5) 10Filippo Giunchedi: dispatch: sync user role and info from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) [11:23:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P39430 and previous config saved to /var/cache/conftool/dbconfig/20221114-112330-ladsgroup.json [11:23:45] (03Abandoned) 10Jcrespo: cross-validate-users: Fix exception when user has not ssh key defined [puppet] - 10https://gerrit.wikimedia.org/r/856518 (owner: 10Jcrespo) [11:23:59] (03Merged) 10jenkins-bot: Re-add s11 in db config reload callback [mediawiki-config] - 10https://gerrit.wikimedia.org/r/854509 (https://phabricator.wikimedia.org/T322598) (owner: 10Ladsgroup) [11:24:14] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:854509|Re-add s11 in db config reload callback (T322598)]] [11:24:18] T322598: LBFactoryMulti.php: PHP Notice: Undefined index: s11 - https://phabricator.wikimedia.org/T322598 [11:24:34] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:854509|Re-add s11 in db config reload callback (T322598)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [11:24:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:25:12] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] acme-chief: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/829122 (owner: 10Muehlenhoff) [11:26:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T321130)', diff saved to https://phabricator.wikimedia.org/P39431 and previous config saved to /var/cache/conftool/dbconfig/20221114-112643-marostegui.json [11:26:48] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [11:26:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10jcrespo) [11:27:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:27:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39432 and previous config saved to /var/cache/conftool/dbconfig/20221114-112729-marostegui.json [11:27:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:27:34] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [11:27:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P39433 and previous config saved to /var/cache/conftool/dbconfig/20221114-112736-ladsgroup.json [11:27:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:27:51] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10jcrespo) [11:27:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39434 and previous config saved to /var/cache/conftool/dbconfig/20221114-112750-marostegui.json [11:28:41] (03PS2) 10Clément Goubert: mw-web: raise ns resource quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/856516 [11:29:15] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:854509|Re-add s11 in db config reload callback (T322598)]] (duration: 05m 01s) [11:29:21] T322598: LBFactoryMulti.php: PHP Notice: Undefined index: s11 - https://phabricator.wikimedia.org/T322598 [11:30:03] (03CR) 10Volans: [C: 03+1] "LGTM if PCC is happy" [puppet] - 10https://gerrit.wikimedia.org/r/849508 (https://phabricator.wikimedia.org/T320696) (owner: 10Jbond) [11:30:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39435 and previous config saved to /var/cache/conftool/dbconfig/20221114-113006-marostegui.json [11:32:06] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 37271 [11:33:52] (03CR) 10Jbond: json-webrequests-stats: add -t/--time-range (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/854521 (owner: 10Volans) [11:34:26] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 37271 [11:38:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T322618)', diff saved to https://phabricator.wikimedia.org/P39436 and previous config saved to /var/cache/conftool/dbconfig/20221114-113837-ladsgroup.json [11:38:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [11:38:41] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:38:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [11:38:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [11:39:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2095.codfw.wmnet with reason: Maintenance [11:39:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T322618)', diff saved to https://phabricator.wikimedia.org/P39437 and previous config saved to /var/cache/conftool/dbconfig/20221114-113913-ladsgroup.json [11:40:16] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 9231 [11:40:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 9231 [11:41:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T322618)', diff saved to https://phabricator.wikimedia.org/P39438 and previous config saved to /var/cache/conftool/dbconfig/20221114-114134-ladsgroup.json [11:41:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P39439 and previous config saved to /var/cache/conftool/dbconfig/20221114-114150-marostegui.json [11:42:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T322618)', diff saved to https://phabricator.wikimedia.org/P39440 and previous config saved to /var/cache/conftool/dbconfig/20221114-114244-ladsgroup.json [11:42:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:43:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [11:43:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [11:43:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [11:43:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39441 and previous config saved to /var/cache/conftool/dbconfig/20221114-114326-ladsgroup.json [11:44:36] (03CR) 10Hnowlan: [C: 03+2] result_storage.swift: re-enable logging of statements during swift requests [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/856515 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:45:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P39442 and previous config saved to /var/cache/conftool/dbconfig/20221114-114512-marostegui.json [11:45:16] (03PS1) 10Muehlenhoff: Remove puppet leftovers of old ELK5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/856521 (https://phabricator.wikimedia.org/T281266) [11:45:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39443 and previous config saved to /var/cache/conftool/dbconfig/20221114-114537-ladsgroup.json [11:45:48] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [11:47:51] (03PS1) 10Muehlenhoff: Remove leftover entry for bast4002 [puppet] - 10https://gerrit.wikimedia.org/r/856522 [11:49:37] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for David.pujol - https://phabricator.wikimedia.org/T322670 (10jcrespo) 05Open→03In progress [11:49:44] (03Merged) 10jenkins-bot: result_storage.swift: re-enable logging of statements during swift requests [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/856515 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:49:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [11:50:07] (03PS1) 10Muehlenhoff: Remove leftover entry for centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/856523 (https://phabricator.wikimedia.org/T298994) [11:54:34] (03PS1) 10Muehlenhoff: Remove leftover Puppet entry [puppet] - 10https://gerrit.wikimedia.org/r/856524 (https://phabricator.wikimedia.org/T292075) [11:56:13] (03CR) 10Clément Goubert: [C: 03+2] mw-web: raise ns resource quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/856516 (owner: 10Clément Goubert) [11:56:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P39444 and previous config saved to /var/cache/conftool/dbconfig/20221114-115641-ladsgroup.json [11:56:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P39445 and previous config saved to /var/cache/conftool/dbconfig/20221114-115656-marostegui.json [11:59:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:00:04] (03PS1) 10Muehlenhoff: Remove puppet leftovers [puppet] - 10https://gerrit.wikimedia.org/r/856525 (https://phabricator.wikimedia.org/T306840) [12:00:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P39446 and previous config saved to /var/cache/conftool/dbconfig/20221114-120019-marostegui.json [12:00:39] (03Merged) 10jenkins-bot: mw-web: raise ns resource quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/856516 (owner: 10Clément Goubert) [12:00:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P39447 and previous config saved to /var/cache/conftool/dbconfig/20221114-120043-ladsgroup.json [12:01:28] (03PS1) 10Hnowlan: thumbor: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/856526 (https://phabricator.wikimedia.org/T233196) [12:02:39] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [12:06:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:11:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P39448 and previous config saved to /var/cache/conftool/dbconfig/20221114-121149-ladsgroup.json [12:12:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T321130)', diff saved to https://phabricator.wikimedia.org/P39449 and previous config saved to /var/cache/conftool/dbconfig/20221114-121202-marostegui.json [12:12:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1145.eqiad.wmnet with reason: Maintenance [12:12:07] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:12:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1145.eqiad.wmnet with reason: Maintenance [12:13:39] (03CR) 10Vgutierrez: [C: 03+2] Add Simplified Chinese translations to browsersec.body.html.erb [puppet] - 10https://gerrit.wikimedia.org/r/816161 (owner: 10Diskdance) [12:14:47] (03CR) 10Vgutierrez: [C: 03+2] "Thanks diskdance && wctaiwan" [puppet] - 10https://gerrit.wikimedia.org/r/816161 (owner: 10Diskdance) [12:15:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39450 and previous config saved to /var/cache/conftool/dbconfig/20221114-121525-marostegui.json [12:15:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1113.eqiad.wmnet with reason: Maintenance [12:15:32] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [12:15:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1113.eqiad.wmnet with reason: Maintenance [12:15:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39451 and previous config saved to /var/cache/conftool/dbconfig/20221114-121547-marostegui.json [12:15:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P39452 and previous config saved to /var/cache/conftool/dbconfig/20221114-121556-ladsgroup.json [12:16:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:18:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39453 and previous config saved to /var/cache/conftool/dbconfig/20221114-121802-marostegui.json [12:18:26] (03CR) 10Hnowlan: [C: 03+2] thumbor: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/856526 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:22:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:22:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1146.eqiad.wmnet with reason: Maintenance [12:22:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T321130)', diff saved to https://phabricator.wikimedia.org/P39454 and previous config saved to /var/cache/conftool/dbconfig/20221114-122214-marostegui.json [12:22:18] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:23:09] (03Merged) 10jenkins-bot: thumbor: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/856526 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [12:24:50] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [12:25:16] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [12:25:16] 10SRE, 10serviceops: Deploy etcddump (or another etcd dump & load tool) to production - https://phabricator.wikimedia.org/T135124 (10jcrespo) I believe this was mislabeled, although please ask for help for dumping scheduling and monitoring, we have tooling we want to extend to services other than databases. [12:25:56] 10SRE, 10serviceops, 10Technical-Debt: Reduce etcd technical debt - https://phabricator.wikimedia.org/T135122 (10jcrespo) [12:26:07] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [12:26:24] PROBLEM - SSH on db1120.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:26:50] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:26:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T322618)', diff saved to https://phabricator.wikimedia.org/P39455 and previous config saved to /var/cache/conftool/dbconfig/20221114-122655-ladsgroup.json [12:26:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [12:27:00] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [12:27:04] 10SRE, 10serviceops: Deploy etcddump (or another etcd dump & load tool) to production - https://phabricator.wikimedia.org/T135124 (10jcrespo) Is this related to T281447? [12:27:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [12:27:11] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:27:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39456 and previous config saved to /var/cache/conftool/dbconfig/20221114-122717-ladsgroup.json [12:28:45] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:29:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39457 and previous config saved to /var/cache/conftool/dbconfig/20221114-122938-ladsgroup.json [12:30:03] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [12:30:53] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:31:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39458 and previous config saved to /var/cache/conftool/dbconfig/20221114-123103-ladsgroup.json [12:31:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:31:17] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [12:31:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [12:31:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:31:33] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [12:31:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [12:31:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T322618)', diff saved to https://phabricator.wikimedia.org/P39459 and previous config saved to /var/cache/conftool/dbconfig/20221114-123141-ladsgroup.json [12:33:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P39460 and previous config saved to /var/cache/conftool/dbconfig/20221114-123309-marostegui.json [12:34:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T321130)', diff saved to https://phabricator.wikimedia.org/P39461 and previous config saved to /var/cache/conftool/dbconfig/20221114-123427-marostegui.json [12:34:32] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [12:41:16] (03PS3) 10Slyngshede: Initial checkin [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) [12:44:43] (03CR) 10Slyngshede: Initial checkin (0320 comments) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/853257 (https://phabricator.wikimedia.org/T313595) (owner: 10Slyngshede) [12:44:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P39462 and previous config saved to /var/cache/conftool/dbconfig/20221114-124444-ladsgroup.json [12:45:20] (03CR) 10Jbond: [C: 04-1] netmon: Open LibreNMS port for netmon2002. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/854951 (https://phabricator.wikimedia.org/T315523) (owner: 10Andrea Denisse) [12:48:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P39463 and previous config saved to /var/cache/conftool/dbconfig/20221114-124815-marostegui.json [12:49:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P39464 and previous config saved to /var/cache/conftool/dbconfig/20221114-124934-marostegui.json [12:49:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:56:10] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: introduce more robust vlan interface naming [puppet] - 10https://gerrit.wikimedia.org/r/856544 [12:56:44] (03CR) 10CI reject: [V: 04-1] cloudgw: introduce more robust vlan interface naming [puppet] - 10https://gerrit.wikimedia.org/r/856544 (owner: 10Arturo Borrero Gonzalez) [12:59:47] (03CR) 10Filippo Giunchedi: [C: 03+2] dispatch: sync user role and info from LDAP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/852992 (https://phabricator.wikimedia.org/T313229) (owner: 10Filippo Giunchedi) [12:59:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P39465 and previous config saved to /var/cache/conftool/dbconfig/20221114-125951-ladsgroup.json [12:59:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:03:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39466 and previous config saved to /var/cache/conftool/dbconfig/20221114-130322-marostegui.json [13:03:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1131.eqiad.wmnet with reason: Maintenance [13:03:27] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:03:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1131.eqiad.wmnet with reason: Maintenance [13:03:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T321126)', diff saved to https://phabricator.wikimedia.org/P39467 and previous config saved to /var/cache/conftool/dbconfig/20221114-130343-marostegui.json [13:04:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P39468 and previous config saved to /var/cache/conftool/dbconfig/20221114-130440-marostegui.json [13:05:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Remove puppet leftovers [puppet] - 10https://gerrit.wikimedia.org/r/856525 (https://phabricator.wikimedia.org/T306840) (owner: 10Muehlenhoff) [13:05:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T321126)', diff saved to https://phabricator.wikimedia.org/P39469 and previous config saved to /var/cache/conftool/dbconfig/20221114-130555-marostegui.json [13:06:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:07:30] 10SRE, 10Traffic: Adapt all the things to localized Special: namespaces - https://phabricator.wikimedia.org/T105434 (10jcrespo) I'm adding the traffic tag, but please feel free to move it to #traffic-icebox or maybe decline it all together? [13:08:14] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: introduce more robust vlan interface naming [puppet] - 10https://gerrit.wikimedia.org/r/856544 [13:09:46] PROBLEM - SSH on db1123.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:11:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:14:21] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: introduce more robust vlan interface naming [puppet] - 10https://gerrit.wikimedia.org/r/856544 [13:14:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39470 and previous config saved to /var/cache/conftool/dbconfig/20221114-131457-ladsgroup.json [13:14:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [13:15:02] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:15:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [13:15:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T322618)', diff saved to https://phabricator.wikimedia.org/P39471 and previous config saved to /var/cache/conftool/dbconfig/20221114-131519-ladsgroup.json [13:15:29] 10SRE: URL shortener subdomains for useful Wikimedia infrastructure - https://phabricator.wikimedia.org/T223319 (10jcrespo) p:05Medium→03Low Priority-wise not giving a judgment call, but reflecting this has been untouched since 2019 and looks like a (nice) feature request, not a bug. [13:17:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T322618)', diff saved to https://phabricator.wikimedia.org/P39472 and previous config saved to /var/cache/conftool/dbconfig/20221114-131740-ladsgroup.json [13:19:02] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti1033.eqiad.wmnet to cluster eqiad and group D [13:19:31] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: introduce more robust vlan interface naming [puppet] - 10https://gerrit.wikimedia.org/r/856544 [13:19:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T321130)', diff saved to https://phabricator.wikimedia.org/P39473 and previous config saved to /var/cache/conftool/dbconfig/20221114-131946-marostegui.json [13:19:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1147.eqiad.wmnet with reason: Maintenance [13:19:51] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:20:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1147.eqiad.wmnet with reason: Maintenance [13:20:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T321130)', diff saved to https://phabricator.wikimedia.org/P39474 and previous config saved to /var/cache/conftool/dbconfig/20221114-132008-marostegui.json [13:20:19] 10SRE, 10Dumps-Generation, 10serviceops, 10MW-1.39-notes, and 2 others: conf* hosts ran out of disk space due to log spam - https://phabricator.wikimedia.org/T322360 (10jijiki) 05Open→03Resolved a:03jijiki This task itself looks like it is done, please reopen if you disagreen or if I am missing somet... [13:20:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:21:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P39475 and previous config saved to /var/cache/conftool/dbconfig/20221114-132101-marostegui.json [13:21:49] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: introduce more robust vlan interface naming [puppet] - 10https://gerrit.wikimedia.org/r/856544 [13:23:38] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:38] PROBLEM - SSH on mw1328.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:25:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:26:23] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: introduce more robust vlan interface naming [puppet] - 10https://gerrit.wikimedia.org/r/856544 [13:26:36] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/pcc-worker1003/38132/" [puppet] - 10https://gerrit.wikimedia.org/r/856544 (owner: 10Arturo Borrero Gonzalez) [13:27:20] RECOVERY - SSH on db1120.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:29:35] (03PS1) 10Bartosz Dziewoński: Use legacy DiscussionTools heading markup except on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856550 (https://phabricator.wikimedia.org/T314714) [13:29:38] (03PS1) 10Bartosz Dziewoński: Use new DiscussionTools heading markup on enwiki and plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856551 (https://phabricator.wikimedia.org/T314714) [13:29:39] (03PS1) 10Bartosz Dziewoński: Use new DiscussionTools heading markup everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856552 (https://phabricator.wikimedia.org/T314714) [13:31:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T321130)', diff saved to https://phabricator.wikimedia.org/P39476 and previous config saved to /var/cache/conftool/dbconfig/20221114-133150-marostegui.json [13:31:55] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [13:31:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T322618)', diff saved to https://phabricator.wikimedia.org/P39477 and previous config saved to /var/cache/conftool/dbconfig/20221114-133157-ladsgroup.json [13:32:04] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [13:32:44] (03PS1) 10Bartosz Dziewoński: ThreadItemStore: Handle race conditions when finding/inserting outside of transaction [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855744 (https://phabricator.wikimedia.org/T322701) [13:32:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P39478 and previous config saved to /var/cache/conftool/dbconfig/20221114-133246-ladsgroup.json [13:33:06] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Make Vue mentor dashboard default by removing GEMentorDashboardUseVue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno) [13:34:10] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: Make Vue mentor dashboard default by removing GEMentorDashboardUseVue (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856008 (owner: 10Sergio Gimeno) [13:35:44] (03CR) 10Klausman: [C: 03+1] istio: change configs to adapt for 1.15.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/855967 (https://phabricator.wikimedia.org/T322193) (owner: 10Elukey) [13:36:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P39479 and previous config saved to /var/cache/conftool/dbconfig/20221114-133608-marostegui.json [13:36:30] (03PS1) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856566 (https://phabricator.wikimedia.org/T315353) [13:36:39] (03PS2) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856566 (https://phabricator.wikimedia.org/T315353) [13:46:31] (03PS7) 10Arturo Borrero Gonzalez: cloudgw: introduce more robust vlan interface naming [puppet] - 10https://gerrit.wikimedia.org/r/856544 [13:46:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P39480 and previous config saved to /var/cache/conftool/dbconfig/20221114-134657-marostegui.json [13:47:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P39481 and previous config saved to /var/cache/conftool/dbconfig/20221114-134704-ladsgroup.json [13:47:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P39482 and previous config saved to /var/cache/conftool/dbconfig/20221114-134752-ladsgroup.json [13:51:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T321126)', diff saved to https://phabricator.wikimedia.org/P39483 and previous config saved to /var/cache/conftool/dbconfig/20221114-135114-marostegui.json [13:51:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:51:19] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [13:51:30] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:51:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:51:53] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1165.eqiad.wmnet with reason: Maintenance [13:51:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:51:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:51:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [13:52:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T321126)', diff saved to https://phabricator.wikimedia.org/P39484 and previous config saved to /var/cache/conftool/dbconfig/20221114-135204-marostegui.json [13:54:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T321126)', diff saved to https://phabricator.wikimedia.org/P39485 and previous config saved to /var/cache/conftool/dbconfig/20221114-135416-marostegui.json [13:55:03] (03PS1) 10Hashar: Merge tag 'v3.5.4' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) [13:55:29] (03CR) 10Vgutierrez: "I built it successfully on build2001 using the CMD:" [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [13:58:01] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/pcc-worker1001/38133/" [puppet] - 10https://gerrit.wikimedia.org/r/856544 (owner: 10Arturo Borrero Gonzalez) [13:59:15] (03PS8) 10Arturo Borrero Gonzalez: cloudgw: introduce more robust vlan interface naming [puppet] - 10https://gerrit.wikimedia.org/r/856544 [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221114T1400). [14:00:04] MichaelG_WMDE and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:17] hey [14:00:20] * MichaelG_WMDE 👋 [14:01:18] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/pcc-worker1001/38134/" [puppet] - 10https://gerrit.wikimedia.org/r/856544 (owner: 10Arturo Borrero Gonzalez) [14:01:31] I can deploy today, give me a minute [14:01:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:02:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P39486 and previous config saved to /var/cache/conftool/dbconfig/20221114-140203-marostegui.json [14:02:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/853275 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [14:02:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P39487 and previous config saved to /var/cache/conftool/dbconfig/20221114-140210-ladsgroup.json [14:02:35] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/853276 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [14:02:49] Thank you taavi! Mine isn't urgent, happy to have MatmaRex go first :) [14:02:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T322618)', diff saved to https://phabricator.wikimedia.org/P39488 and previous config saved to /var/cache/conftool/dbconfig/20221114-140259-ladsgroup.json [14:03:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [14:03:04] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [14:03:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [14:03:19] (03CR) 10Jbond: [C: 03+1] pki: generate CSR for root_ca [puppet] - 10https://gerrit.wikimedia.org/r/853277 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [14:03:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39489 and previous config saved to /var/cache/conftool/dbconfig/20221114-140320-ladsgroup.json [14:03:34] (03PS2) 10Majavah: Separate identifiers from other statements for Lexemes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855609 (https://phabricator.wikimedia.org/T318310) (owner: 10Michael Große) [14:04:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855609 (https://phabricator.wikimedia.org/T318310) (owner: 10Michael Große) [14:04:09] MatmaRex: hey, does the order of your patches matter? [14:04:12] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/853278 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [14:04:49] (03Merged) 10jenkins-bot: Separate identifiers from other statements for Lexemes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/855609 (https://phabricator.wikimedia.org/T318310) (owner: 10Michael Große) [14:04:56] taavi: yeah, the second config depends on the backports [14:05:02] !log taavi@deploy1002 Started scap: Backport for [[gerrit:855609|Separate identifiers from other statements for Lexemes (T318310)]] [14:05:07] T318310: Wikidata Lexeme pages does not seperate between identifiers and other statements - https://phabricator.wikimedia.org/T318310 [14:05:16] (03CR) 10Jbond: "the -1 here is just to ensure we copy the keys first otherwise all good" [puppet] - 10https://gerrit.wikimedia.org/r/853281 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [14:05:23] !log taavi@deploy1002 taavi and migr: Backport for [[gerrit:855609|Separate identifiers from other statements for Lexemes (T318310)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:05:24] ok! then I'll +2 the backports now to speed everything up [14:05:34] the first config is good to go whenever [14:05:38] MichaelG_WMDE: your patch is available for testing [14:05:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39490 and previous config saved to /var/cache/conftool/dbconfig/20221114-140541-ladsgroup.json [14:05:54] (03CR) 10Majavah: [C: 03+2] "backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855744 (https://phabricator.wikimedia.org/T322701) (owner: 10Bartosz Dziewoński) [14:06:00] (03PS3) 10Majavah: persistRevisionThreadItems: Print time taken [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855745 (owner: 10Bartosz Dziewoński) [14:06:27] (03CR) 10Majavah: [C: 03+2] "backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855745 (owner: 10Bartosz Dziewoński) [14:06:54] taavi: I tested it and it works as expected 👍 [14:06:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:07:19] thanks, syncing [14:07:51] FTR: tested both on test.wikidata.org and www.wikidata.org, works as expected on both. [14:09:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P39491 and previous config saved to /var/cache/conftool/dbconfig/20221114-140923-marostegui.json [14:11:30] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:855609|Separate identifiers from other statements for Lexemes (T318310)]] (duration: 06m 27s) [14:11:34] T318310: Wikidata Lexeme pages does not seperate between identifiers and other statements - https://phabricator.wikimedia.org/T318310 [14:11:41] MichaelG_WMDE: all done! [14:11:54] MatmaRex: starting with your first config change next [14:11:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856550 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [14:11:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:12:03] taavi: Thank you! 🙏 [14:12:06] (03PS2) 10Majavah: Use legacy DiscussionTools heading markup except on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856550 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [14:12:10] (03CR) 10TrainBranchBot: "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856550 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [14:12:49] taavi: thanks. to clarify, this is a no-op change, the code using this config setting is not deployed yet, so there's nothing i could test here [14:12:57] (03Merged) 10jenkins-bot: Use legacy DiscussionTools heading markup except on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856550 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [14:12:59] (03Merged) 10jenkins-bot: ThreadItemStore: Handle race conditions when finding/inserting outside of transaction [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855744 (https://phabricator.wikimedia.org/T322701) (owner: 10Bartosz Dziewoński) [14:13:02] (03Merged) 10jenkins-bot: persistRevisionThreadItems: Print time taken [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855745 (owner: 10Bartosz Dziewoński) [14:13:32] !log taavi@deploy1002 prep aborted: (duration: 00m 06s) [14:13:32] !log taavi@deploy1002 backport aborted: (duration: 01m 23s) [14:14:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856550 (https://phabricator.wikimedia.org/T314714) (owner: 10Bartosz Dziewoński) [14:14:13] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855744 (https://phabricator.wikimedia.org/T322701) (owner: 10Bartosz Dziewoński) [14:14:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/855745 (owner: 10Bartosz Dziewoński) [14:14:23] MatmaRex: ack [14:14:24] !log taavi@deploy1002 Started scap: Backport for [[gerrit:856550|Use legacy DiscussionTools heading markup except on beta cluster (T314714)]], [[gerrit:855744|ThreadItemStore: Handle race conditions when finding/inserting outside of transaction (T322701)]], [[gerrit:855745|persistRevisionThreadItems: Print time taken]] [14:14:29] T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714 [14:14:30] T322701: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '…' for key 'itid_itemid' / 'it_itemname' - https://phabricator.wikimedia.org/T322701 [14:14:44] !log taavi@deploy1002 taavi and matmarex: Backport for [[gerrit:856550|Use legacy DiscussionTools heading markup except on beta cluster (T314714)]], [[gerrit:855744|ThreadItemStore: Handle race conditions when finding/inserting outside of transaction (T322701)]], [[gerrit:855745|persistRevisionThreadItems: Print time taken]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmn [14:14:44] et, mwdebug2002.codfw.wmnet [14:14:58] the backports got merged so including that in the same sync. are the backports testable without the last config change? [14:15:13] (03Abandoned) 10Urbanecm: MentorFilterHooks: Only consider active mentees [extensions/GrowthExperiments] (wmf/1.40.0-wmf.8) - 10https://gerrit.wikimedia.org/r/853444 (https://phabricator.wikimedia.org/T318457) (owner: 10Urbanecm) [14:15:58] taavi: not really. technically the code is enabled on other wikis, but they weren't suffering from the problems this fixes [14:16:24] ok, I'll just sync then [14:17:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T321130)', diff saved to https://phabricator.wikimedia.org/P39492 and previous config saved to /var/cache/conftool/dbconfig/20221114-141710-marostegui.json [14:17:12] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1148.eqiad.wmnet with reason: Maintenance [14:17:15] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [14:17:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T322618)', diff saved to https://phabricator.wikimedia.org/P39493 and previous config saved to /var/cache/conftool/dbconfig/20221114-141717-ladsgroup.json [14:17:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:17:22] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [14:17:24] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.5.4' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [14:17:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1148.eqiad.wmnet with reason: Maintenance [14:17:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T321130)', diff saved to https://phabricator.wikimedia.org/P39494 and previous config saved to /var/cache/conftool/dbconfig/20221114-141731-marostegui.json [14:17:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:17:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1162 (T322618)', diff saved to https://phabricator.wikimedia.org/P39495 and previous config saved to /var/cache/conftool/dbconfig/20221114-141738-ladsgroup.json [14:19:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T322618)', diff saved to https://phabricator.wikimedia.org/P39496 and previous config saved to /var/cache/conftool/dbconfig/20221114-141950-ladsgroup.json [14:20:39] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:856550|Use legacy DiscussionTools heading markup except on beta cluster (T314714)]], [[gerrit:855744|ThreadItemStore: Handle race conditions when finding/inserting outside of transaction (T322701)]], [[gerrit:855745|persistRevisionThreadItems: Print time taken]] (duration: 06m 14s) [14:20:44] T314714: Metadata and buttons should be inserted after a heading, not inside of it - https://phabricator.wikimedia.org/T314714 [14:20:44] T322701: Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry '…' for key 'itid_itemid' / 'it_itemname' - https://phabricator.wikimedia.org/T322701 [14:20:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P39497 and previous config saved to /var/cache/conftool/dbconfig/20221114-142048-ladsgroup.json [14:20:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:21:18] (03PS3) 10Majavah: Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856566 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [14:21:21] (03CR) 10CI reject: [V: 04-1] Merge tag 'v3.5.4' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [14:21:31] 10SRE, 10Traffic, 10Patch-For-Review: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) [14:21:38] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856566 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [14:22:17] (03CR) 10Hashar: [C: 03+2] "CI failed due to:" [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [14:22:31] 10SRE, 10Traffic, 10Patch-For-Review: oom killed varnish on cp4047 - https://phabricator.wikimedia.org/T322903 (10Vgutierrez) I've updated the description of the task to add the used memory by varnish after running the experiment for the whole weekend. Data gathered with `systemctl show varnish-frontend -p M... [14:22:50] (03Merged) 10jenkins-bot: Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis (#2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856566 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [14:23:05] !log taavi@deploy1002 Started scap: Backport for [[gerrit:856566|Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis (#2) (T315353)]] [14:23:10] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [14:23:26] !log taavi@deploy1002 taavi and matmarex: Backport for [[gerrit:856566|Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis (#2) (T315353)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet [14:23:40] MatmaRex: the config change is available for testing [14:24:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P39498 and previous config saved to /var/cache/conftool/dbconfig/20221114-142429-marostegui.json [14:24:47] looking [14:25:51] taavi: seems good [14:25:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:26:07] ok, syncing [14:26:14] (03CR) 10CI reject: [V: 04-1] Merge tag 'v3.5.4' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [14:27:36] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [14:27:36] i'll be looking at the "mediawiki-new-errors" dashboard to make sure the errors (which should be fixed by the backports) aren't reoccurring [14:29:29] taavi: will you start the maintenance script as well? [14:29:36] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [14:29:46] it will probably take a few days to finish, on large wikis like commonswiki. can you save the output to a file, and upload it to phab later? [14:30:06] sure! [14:30:11] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:856566|Enable wgDiscussionToolsEnablePermalinksBackend on group1 wikis (#2) (T315353)]] (duration: 07m 05s) [14:30:15] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [14:30:22] (03CR) 10Btullis: "There appears to be an issue with CI, whereby the namespace associated with the deployment to the dse-k8s cluster is being reported as 'de" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [14:30:51] MatmaRex: is there a task it's related to? [14:31:16] oh, yes, sorry [14:31:49] taavi: https://phabricator.wikimedia.org/T315510 [14:32:40] !log START taavi@deploy1002:~$ foreachwikiindblist group1 extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --current --all | tee T315510.log # T315510 [14:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:44] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [14:33:35] !log ^ correction, starting it on mwmaint1002, not deploy1002 [14:33:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:54] thanks [14:34:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P39499 and previous config saved to /var/cache/conftool/dbconfig/20221114-143456-ladsgroup.json [14:34:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:35:07] MatmaRex: please remind me to upload the logs somewhere, I'll probably forget otherwise [14:35:34] 10SRE, 10MediaWiki-Authentication-and-authorization, 10Platform Engineering, 10serviceops: Increased session loss since 20221001 - https://phabricator.wikimedia.org/T319279 (10jijiki) 05Open→03Resolved a:03jijiki Per @hnowlan's latest comment, I am marking this as resolved [14:35:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P39500 and previous config saved to /var/cache/conftool/dbconfig/20221114-143554-ladsgroup.json [14:37:30] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [14:38:35] taavi: hm, there has been a couple of errors logged, but i don't think they're a cause for alarm [14:38:55] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.5.4' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [14:38:57] i'm looking at https://logstash.wikimedia.org/goto/28d1533cf8c78aed8025f63493274926 [14:39:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T321126)', diff saved to https://phabricator.wikimedia.org/P39501 and previous config saved to /var/cache/conftool/dbconfig/20221114-143936-marostegui.json [14:39:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:39:41] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [14:39:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1168.eqiad.wmnet with reason: Maintenance [14:39:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T321126)', diff saved to https://phabricator.wikimedia.org/P39502 and previous config saved to /var/cache/conftool/dbconfig/20221114-143957-marostegui.json [14:39:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:40:28] MatmaRex: hmm, those indeed look worrying. lmk if you want something reverted [14:41:11] (03CR) 10Herron: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/856523 (https://phabricator.wikimedia.org/T298994) (owner: 10Muehlenhoff) [14:41:20] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [14:42:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T321126)', diff saved to https://phabricator.wikimedia.org/P39503 and previous config saved to /var/cache/conftool/dbconfig/20221114-144209-marostegui.json [14:42:22] taavi: i don't think we need to revert, i'll file tasks about them. they shouldn't be causing any visible problems (they are explicitly caught and logged in our code, this isn't breaking the jobs or anything) [14:42:53] (03CR) 10CI reject: [V: 04-1] Merge tag 'v3.5.4' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [14:43:00] (03CR) 10JMeybohm: [C: 04-1] Add a spark-operator chart and helmfile configuraiton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [14:43:02] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38135/console" [puppet] - 10https://gerrit.wikimedia.org/r/855997 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [14:43:04] (03CR) 10Herron: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/856521 (https://phabricator.wikimedia.org/T281266) (owner: 10Muehlenhoff) [14:43:45] (03PS2) 10Herron: Remove leftover entry for centrallog2001 [puppet] - 10https://gerrit.wikimedia.org/r/856523 (https://phabricator.wikimedia.org/T298994) (owner: 10Muehlenhoff) [14:43:48] taavi: also the number of occurrences is very low this time [14:43:57] (03PS1) 10Jbond: testing: add files useful for testing locally [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/856562 [14:44:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T321130)', diff saved to https://phabricator.wikimedia.org/P39504 and previous config saved to /var/cache/conftool/dbconfig/20221114-144415-marostegui.json [14:44:20] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [14:45:20] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [14:46:27] (03PS1) 10Jbond: testing: add files useful for testing locally [software/cas-overlay-template] (testing) - 10https://gerrit.wikimedia.org/r/856563 [14:46:29] these errors look like we have "duplicate" jobs running, processing the same revisions twice at the same time? [14:49:04] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti1021.eqiad.wmnet with reason: Remove from cluster for eventual reimage [14:49:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti1021.eqiad.wmnet with reason: Remove from cluster for eventual reimage [14:50:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P39505 and previous config saved to /var/cache/conftool/dbconfig/20221114-145003-ladsgroup.json [14:50:44] PROBLEM - MariaDB Replica IO: s8 on db2166 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:50:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:51:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39506 and previous config saved to /var/cache/conftool/dbconfig/20221114-145101-ladsgroup.json [14:51:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [14:51:06] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [14:51:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1021.eqiad.wmnet with OS bullseye [14:51:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [14:51:20] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1021.eqiad.wmnet with OS bullseye [14:51:20] PROBLEM - MariaDB Replica SQL: s8 on db2166 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:51:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T322618)', diff saved to https://phabricator.wikimedia.org/P39507 and previous config saved to /var/cache/conftool/dbconfig/20221114-145122-ladsgroup.json [14:51:48] PROBLEM - MariaDB read only s8 on db2166 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [14:53:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T322618)', diff saved to https://phabricator.wikimedia.org/P39508 and previous config saved to /var/cache/conftool/dbconfig/20221114-145343-ladsgroup.json [14:53:52] taavi: just to confirm, my changes are still deployed, and the script is running? [14:54:10] taavi: i am okay with the small number of errors, but let me know if you think this is a problem [14:54:21] i'm planning to file a bug and look into it later today [14:54:26] (03CR) 10Jbond: [C: 03+1] "lgtm, used `git diff upstream/6.6 -- . :^src :^etc :^debian`" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/856480 (owner: 10Muehlenhoff) [14:54:57] (03PS5) 10Btullis: Add a spark-operator chart and helmfile configuraiton [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) [14:55:00] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 126 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:55:13] (03CR) 10Btullis: Add a spark-operator chart and helmfile configuraiton (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [14:55:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:56:56] (03CR) 10Muehlenhoff: Sync to 6.6.2 of the CAS overlay (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/856480 (owner: 10Muehlenhoff) [14:56:58] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:57:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P39509 and previous config saved to /var/cache/conftool/dbconfig/20221114-145715-marostegui.json [14:57:52] (03CR) 10Jbond: "unfortunately brandon has reverted all the changes that related to the old data structure as such i think this will need to be reworked. " [puppet] - 10https://gerrit.wikimedia.org/r/841148 (owner: 10Hoo man) [14:59:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P39510 and previous config saved to /var/cache/conftool/dbconfig/20221114-145921-marostegui.json [14:59:27] MatmaRex: correct, I haven't stopped or reverted anything [15:00:39] okay. thanks [15:01:51] (03CR) 10Btullis: "As per this comment: https://phabricator.wikimedia.org/T318926#8389971" [deployment-charts] - 10https://gerrit.wikimedia.org/r/855674 (https://phabricator.wikimedia.org/T318926) (owner: 10Btullis) [15:02:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:03:43] (03PS1) 10Vgutierrez: prometheus::ats_node_config: remove prometheus-ats-config [puppet] - 10https://gerrit.wikimedia.org/r/856586 (https://phabricator.wikimedia.org/T292815) [15:04:02] (03CR) 10CI reject: [V: 04-1] Merge tag 'v3.5.4' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [15:04:45] PROBLEM - MariaDB Replica Lag: s8 on db2166 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:04:47] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38136/console" [puppet] - 10https://gerrit.wikimedia.org/r/856586 (https://phabricator.wikimedia.org/T292815) (owner: 10Vgutierrez) [15:04:58] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01088 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:05:05] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] prometheus::ats_node_config: remove prometheus-ats-config [puppet] - 10https://gerrit.wikimedia.org/r/856586 (https://phabricator.wikimedia.org/T292815) (owner: 10Vgutierrez) [15:05:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T322618)', diff saved to https://phabricator.wikimedia.org/P39511 and previous config saved to /var/cache/conftool/dbconfig/20221114-150509-ladsgroup.json [15:05:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:05:14] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [15:05:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:05:31] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1021.eqiad.wmnet with reason: host reimage [15:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39512 and previous config saved to /var/cache/conftool/dbconfig/20221114-150531-ladsgroup.json [15:06:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39513 and previous config saved to /var/cache/conftool/dbconfig/20221114-150642-ladsgroup.json [15:07:25] puppet agent failures is probably me, already fixed :) [15:07:31] PROBLEM - MariaDB Replica IO: s8 on db2166 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:08:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P39514 and previous config saved to /var/cache/conftool/dbconfig/20221114-150850-ladsgroup.json [15:09:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti1021.eqiad.wmnet with reason: host reimage [15:09:02] (03CR) 10CI reject: [V: 04-1] Merge tag 'v3.5.4' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [15:09:33] (03PS3) 10Ssingh: Release 0.15.0-2 [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) [15:10:59] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:11:13] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:11:24] (03CR) 10Ssingh: Release 0.15.0-2 (031 comment) [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:12:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P39515 and previous config saved to /var/cache/conftool/dbconfig/20221114-151222-marostegui.json [15:13:01] !log initiating Cassandra bootstrap, aqs1019-a -- T307802 [15:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:05] T307802: Bootstrap new Cassandra nodes (eqiad) - https://phabricator.wikimedia.org/T307802 [15:14:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P39516 and previous config saved to /var/cache/conftool/dbconfig/20221114-151428-marostegui.json [15:15:33] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:15:42] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:16:21] (03CR) 10Vgutierrez: "I've missed another warning:" [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:17:00] (03CR) 10Ssingh: Release 0.15.0-2 (031 comment) [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:17:49] RECOVERY - cassandra-a SSL 10.64.48.119:7001 on aqs1019 is OK: SSL OK - Certificate aqs1019-a valid until 2024-11-08 15:06:30 +0000 (expires in 724 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [15:17:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:19:43] (03CR) 10Vgutierrez: Release 0.15.0-2 (031 comment) [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:20:33] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thank you for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/853275 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:21:22] (03CR) 10Filippo Giunchedi: [C: 03+2] cfssl: introduce cfssl::csr [puppet] - 10https://gerrit.wikimedia.org/r/853276 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:21:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P39518 and previous config saved to /var/cache/conftool/dbconfig/20221114-152148-ladsgroup.json [15:22:52] (03CR) 10Filippo Giunchedi: [C: 03+2] pki: generate CSR for root_ca [puppet] - 10https://gerrit.wikimedia.org/r/853277 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:23:05] (03CR) 10Vgutierrez: [C: 03+1] "looking good, lintian warnings are gone, pristine-tar was a L8 issue on my side" [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:23:55] (03CR) 10Filippo Giunchedi: [C: 03+2] pki: run initca for root_ca when needed [puppet] - 10https://gerrit.wikimedia.org/r/853278 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:23:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P39519 and previous config saved to /var/cache/conftool/dbconfig/20221114-152356-ladsgroup.json [15:24:05] (03PS7) 10Filippo Giunchedi: pki: run initca for root_ca when needed [puppet] - 10https://gerrit.wikimedia.org/r/853278 (https://phabricator.wikimedia.org/T319163) [15:24:59] (03PS1) 10JMeybohm: k8s: Add a central ipv6dualstack flag to enable dual stack [puppet] - 10https://gerrit.wikimedia.org/r/856589 (https://phabricator.wikimedia.org/T307943) [15:26:05] (03CR) 10Ssingh: Release 0.15.0-2 (031 comment) [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:26:48] (03Abandoned) 10Vgutierrez: varnish: Increase reserved memory to 120G in upload@ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/855965 (https://phabricator.wikimedia.org/T322903) (owner: 10Vgutierrez) [15:27:13] RECOVERY - SSH on mw1328.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:27:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T321126)', diff saved to https://phabricator.wikimedia.org/P39520 and previous config saved to /var/cache/conftool/dbconfig/20221114-152728-marostegui.json [15:27:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:27:34] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [15:27:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1180.eqiad.wmnet with reason: Maintenance [15:27:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T321126)', diff saved to https://phabricator.wikimedia.org/P39521 and previous config saved to /var/cache/conftool/dbconfig/20221114-152749-marostegui.json [15:28:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti1021.eqiad.wmnet with OS bullseye [15:28:22] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/eqiad to Bullseye - https://phabricator.wikimedia.org/T311687 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti1021.eqiad.wmnet with OS bullseye completed: - ganeti1021 (**PASS**) - Downtimed on... [15:28:53] (03CR) 10Elukey: [C: 03+1] k8s: make profile::kubernetes::cluster_cidr mandatory [puppet] - 10https://gerrit.wikimedia.org/r/855997 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm) [15:29:19] (03CR) 10Vgutierrez: prometheus: Handle inactive trafficserver service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/851669 (https://phabricator.wikimedia.org/T292815) (owner: 10BCornwall) [15:29:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T321130)', diff saved to https://phabricator.wikimedia.org/P39522 and previous config saved to /var/cache/conftool/dbconfig/20221114-152936-marostegui.json [15:29:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1149.eqiad.wmnet with reason: Maintenance [15:29:41] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [15:30:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T321126)', diff saved to https://phabricator.wikimedia.org/P39523 and previous config saved to /var/cache/conftool/dbconfig/20221114-153001-marostegui.json [15:30:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1149.eqiad.wmnet with reason: Maintenance [15:30:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T321130)', diff saved to https://phabricator.wikimedia.org/P39524 and previous config saved to /var/cache/conftool/dbconfig/20221114-153030-marostegui.json [15:30:45] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.002966 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:31:36] (03CR) 10Filippo Giunchedi: [C: 03+2] cfssl: search intermediate key on filesystem [puppet] - 10https://gerrit.wikimedia.org/r/853279 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:32:37] (03CR) 10Filippo Giunchedi: cfssl: change intermediate key path (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853281 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:32:58] (KubernetesRsyslogDown) firing: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:33:45] RECOVERY - cassandra-a service on aqs1019 is OK: OK - cassandra-a is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:34:00] (03PS1) 10Giuseppe Lavagetto: wikidata: fix puppet for finding load balancers after reverts [puppet] - 10https://gerrit.wikimedia.org/r/856590 [15:34:27] PROBLEM - dispatch.wikimedia.org requires authentication on alert1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 301 Moved Permanently https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:35:15] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38137/console" [puppet] - 10https://gerrit.wikimedia.org/r/856590 (owner: 10Giuseppe Lavagetto) [15:36:48] (03PS1) 10Vgutierrez: prometheus/trafficserver: Remove node_ats_config [puppet] - 10https://gerrit.wikimedia.org/r/856593 (https://phabricator.wikimedia.org/T292815) [15:36:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P39525 and previous config saved to /var/cache/conftool/dbconfig/20221114-153654-ladsgroup.json [15:37:15] (03CR) 10CI reject: [V: 04-1] prometheus/trafficserver: Remove node_ats_config [puppet] - 10https://gerrit.wikimedia.org/r/856593 (https://phabricator.wikimedia.org/T292815) (owner: 10Vgutierrez) [15:37:55] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] wikidata: fix puppet for finding load balancers after reverts [puppet] - 10https://gerrit.wikimedia.org/r/856590 (owner: 10Giuseppe Lavagetto) [15:37:58] (KubernetesRsyslogDown) resolved: rsyslog on kubestage2002:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2002 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:38:21] (03CR) 10Ssingh: "Are we skipping cp4046 intentionally? That hiera override for that specifies:" [puppet] - 10https://gerrit.wikimedia.org/r/856000 (owner: 10Vgutierrez) [15:38:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov2004'] [15:39:03] (03CR) 10Jbond: [C: 03+1] "thanks lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/853281 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:39:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T322618)', diff saved to https://phabricator.wikimedia.org/P39526 and previous config saved to /var/cache/conftool/dbconfig/20221114-153903-ladsgroup.json [15:39:10] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [15:39:20] (03PS2) 10Vgutierrez: prometheus/trafficserver: Remove node_ats_config [puppet] - 10https://gerrit.wikimedia.org/r/856593 (https://phabricator.wikimedia.org/T292815) [15:39:50] (03CR) 10Vgutierrez: [V: 03+1] varnish: Remove deprecated jemalloc options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856000 (owner: 10Vgutierrez) [15:41:51] (03CR) 10Filippo Giunchedi: [C: 03+2] pki: generate CSR for root_ca (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853277 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:41:53] (03PS1) 10Giuseppe Lavagetto: wikidata: fix for data structure manipulation [puppet] - 10https://gerrit.wikimedia.org/r/856594 [15:42:07] (03PS2) 10Giuseppe Lavagetto: wikidata: fix for data structure manipulation [puppet] - 10https://gerrit.wikimedia.org/r/856594 [15:42:13] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] wikidata: fix for data structure manipulation [puppet] - 10https://gerrit.wikimedia.org/r/856594 (owner: 10Giuseppe Lavagetto) [15:42:44] (03CR) 10Ssingh: [C: 03+1] "Looks good, verifie PCC." [puppet] - 10https://gerrit.wikimedia.org/r/856000 (owner: 10Vgutierrez) [15:42:48] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['puppetdb2003'] [15:43:26] (03CR) 10Vgutierrez: [C: 03+1] Release 0.15.0-2 (031 comment) [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:43:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T321130)', diff saved to https://phabricator.wikimedia.org/P39527 and previous config saved to /var/cache/conftool/dbconfig/20221114-154331-marostegui.json [15:43:36] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [15:43:55] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P39528 and previous config saved to /var/cache/conftool/dbconfig/20221114-154507-marostegui.json [15:45:21] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:03] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] varnish: Remove deprecated jemalloc options [puppet] - 10https://gerrit.wikimedia.org/r/856000 (owner: 10Vgutierrez) [15:48:05] (03CR) 10Ssingh: "Merging without CI as per vgutierrez's +1." [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:48:16] (03CR) 10Ssingh: [V: 03+2 C: 03+2] Release 0.15.0-2 [debs/varnish-modules] - 10https://gerrit.wikimedia.org/r/855996 (https://phabricator.wikimedia.org/T321309) (owner: 10Ssingh) [15:50:40] (03CR) 10Jbond: [C: 03+1] pki: generate CSR for root_ca (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/853277 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:50:54] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [15:51:06] (03PS1) 10Hashar: gerrit: script to report on git gc durations [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) [15:51:47] (03CR) 10CI reject: [V: 04-1] gerrit: script to report on git gc durations [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [15:52:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T322618)', diff saved to https://phabricator.wikimedia.org/P39529 and previous config saved to /var/cache/conftool/dbconfig/20221114-155201-ladsgroup.json [15:52:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:52:07] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [15:52:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [15:52:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T322618)', diff saved to https://phabricator.wikimedia.org/P39530 and previous config saved to /var/cache/conftool/dbconfig/20221114-155222-ladsgroup.json [15:52:53] PROBLEM - mysqld processes on db2166 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:53:47] (03CR) 10Hashar: [C: 03+2] "I have manually populated the cache in Castor for wmf/stable-3.5" [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [15:54:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T322618)', diff saved to https://phabricator.wikimedia.org/P39531 and previous config saved to /var/cache/conftool/dbconfig/20221114-155435-ladsgroup.json [15:54:41] RECOVERY - mysqld processes on db2166 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:57:21] (03PS8) 10Filippo Giunchedi: cfssl: change intermediate key path [puppet] - 10https://gerrit.wikimedia.org/r/853281 (https://phabricator.wikimedia.org/T319163) [15:57:22] (03PS1) 10Filippo Giunchedi: pki: move root common settings to profile [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) [15:57:52] (03CR) 10CI reject: [V: 04-1] Merge tag 'v3.5.4' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [15:58:29] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38140/console" [puppet] - 10https://gerrit.wikimedia.org/r/856603 (https://phabricator.wikimedia.org/T319163) (owner: 10Filippo Giunchedi) [15:58:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P39532 and previous config saved to /var/cache/conftool/dbconfig/20221114-155838-marostegui.json [15:59:05] RECOVERY - dispatch.wikimedia.org requires authentication on alert1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 588 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:00:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P39533 and previous config saved to /var/cache/conftool/dbconfig/20221114-160014-marostegui.json [16:01:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'MySQL issues', diff saved to https://phabricator.wikimedia.org/P39534 and previous config saved to /var/cache/conftool/dbconfig/20221114-160140-ladsgroup.json [16:03:16] !log reprepro -C main include bullseye-wikimedia varnish-modules_0.15.0-2_amd64.changes: T321309 [16:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:21] T321309: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 [16:04:57] (03CR) 10Filippo Giunchedi: "LGTM, please send a PCC run along too. Also is there a task / more context for this feature?" [puppet] - 10https://gerrit.wikimedia.org/r/850628 (owner: 10Majavah) [16:04:57] PROBLEM - dispatch.wikimedia.org requires authentication on alert1001 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://dispatch.wikimedia.org:443/ - 552 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:05:05] (03CR) 10Filippo Giunchedi: [C: 03+1] Retire raid1-lvm-xfs-nova.cfg [puppet] - 10https://gerrit.wikimedia.org/r/855975 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [16:05:05] PROBLEM - MariaDB read only s8 on db2166 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:05:29] (03PS2) 10Majavah: thanos::rule: allow using without object store [puppet] - 10https://gerrit.wikimedia.org/r/850628 [16:05:31] (03PS2) 10Majavah: P:metricsinfra: add thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/850629 (https://phabricator.wikimedia.org/T286301) [16:07:39] (03PS3) 10Majavah: thanos::rule: allow using without object store [puppet] - 10https://gerrit.wikimedia.org/r/850628 [16:07:41] (03PS3) 10Majavah: P:metricsinfra: add thanos rule [puppet] - 10https://gerrit.wikimedia.org/r/850629 (https://phabricator.wikimedia.org/T286301) [16:08:15] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['dbprov2004'] [16:08:23] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbprov2004'] [16:08:41] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38143/console" [puppet] - 10https://gerrit.wikimedia.org/r/850629 (https://phabricator.wikimedia.org/T286301) (owner: 10Majavah) [16:09:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P39535 and previous config saved to /var/cache/conftool/dbconfig/20221114-160941-ladsgroup.json [16:10:09] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38144/console" [puppet] - 10https://gerrit.wikimedia.org/r/850628 (owner: 10Majavah) [16:10:53] (03CR) 10Majavah: [V: 03+1] thanos::rule: allow using without object store (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/850628 (owner: 10Majavah) [16:11:43] RECOVERY - SSH on db1123.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:12:36] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['puppetdb2003'] [16:12:41] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['puppetdb2003'] [16:13:24] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['puppetdb2003'] [16:13:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P39536 and previous config saved to /var/cache/conftool/dbconfig/20221114-161344-marostegui.json [16:14:53] RECOVERY - dispatch.wikimedia.org requires authentication on alert1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 588 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:15:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T321126)', diff saved to https://phabricator.wikimedia.org/P39537 and previous config saved to /var/cache/conftool/dbconfig/20221114-161520-marostegui.json [16:15:22] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1187.eqiad.wmnet with reason: Maintenance [16:15:26] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [16:15:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbprov2004'] [16:15:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1187.eqiad.wmnet with reason: Maintenance [16:15:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1187 (T321126)', diff saved to https://phabricator.wikimedia.org/P39539 and previous config saved to /var/cache/conftool/dbconfig/20221114-161553-marostegui.json [16:18:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T321126)', diff saved to https://phabricator.wikimedia.org/P39540 and previous config saved to /var/cache/conftool/dbconfig/20221114-161804-marostegui.json [16:18:55] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10Ottomata) Approved. [16:19:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Dasm - https://phabricator.wikimedia.org/T322591 (10jcrespo) [16:20:53] PROBLEM - dispatch.wikimedia.org requires authentication on alert1001 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://dispatch.wikimedia.org:443/ - 552 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:21:21] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [16:23:15] (03PS1) 10Ssingh: lvs4005: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/856604 [16:23:57] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38145/console" [puppet] - 10https://gerrit.wikimedia.org/r/856604 (owner: 10Ssingh) [16:24:40] (03PS5) 10Jbond: profile::mcrouter_wancache: Add remote DC gutter routes [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [16:24:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P39541 and previous config saved to /var/cache/conftool/dbconfig/20221114-162448-ladsgroup.json [16:27:13] (03CR) 10CI reject: [V: 04-1] profile::mcrouter_wancache: Add remote DC gutter routes [puppet] - 10https://gerrit.wikimedia.org/r/730962 (https://phabricator.wikimedia.org/T258779) (owner: 10Effie Mouzeli) [16:28:01] (03CR) 10Vgutierrez: [C: 03+1] "as soon as you restart pybal after running puppet this will effectively depool lvs4005" [puppet] - 10https://gerrit.wikimedia.org/r/856604 (owner: 10Ssingh) [16:28:49] !log cr3-ulsfo: set routing-options static route 198.35.26.96/28 next-hop 10.128.0.18 [lvs4005 decomm] [16:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T321130)', diff saved to https://phabricator.wikimedia.org/P39542 and previous config saved to /var/cache/conftool/dbconfig/20221114-162851-marostegui.json [16:28:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:28:58] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:29:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221114T1630). [16:30:18] !log cr4-ulsfo: set routing-options static route 198.35.26.96/28 next-hop 10.128.0.18 [lvs4005 decomm] [16:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:43] RECOVERY - dispatch.wikimedia.org requires authentication on alert1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 588 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:31:12] (03CR) 10Ssingh: [V: 03+1 C: 03+2] lvs4005: set profile::pybal::bgp to no [puppet] - 10https://gerrit.wikimedia.org/r/856604 (owner: 10Ssingh) [16:32:05] (03PS1) 10Marostegui: db2166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856606 (https://phabricator.wikimedia.org/T323040) [16:33:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P39543 and previous config saved to /var/cache/conftool/dbconfig/20221114-163312-marostegui.json [16:33:15] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:33:24] (03CR) 10Marostegui: [C: 03+2] db2166: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/856606 (https://phabricator.wikimedia.org/T323040) (owner: 10Marostegui) [16:33:51] ^^ sukhe I'm assuming that you've depooled lvs4005 already? :) [16:33:54] yep [16:34:07] you might want to log it ;P [16:34:07] https://grafana.wikimedia.org/d/000000343/load-balancers-lvs?orgId=1&var-site=ulsfo&viewPanel=4&from=now-5m&to=now [16:34:35] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users & Kerberos identity for Ilooremeta - https://phabricator.wikimedia.org/T322147 (10ILooremeta-WMF) Thanks, @fgiunchedi . The mail had been marked as spam, that's why I was missing it. [16:34:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2166.codfw.wmnet with reason: Host crashed T323040 [16:34:51] T323040: db2166 crashed several times - https://phabricator.wikimedia.org/T323040 [16:35:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2166.codfw.wmnet with reason: Host crashed T323040 [16:36:08] (03PS1) 10AikoChou: ml-services: update revert-risk's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/856609 (https://phabricator.wikimedia.org/T323023) [16:37:38] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:38:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1160.eqiad.wmnet with reason: Maintenance [16:39:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1160.eqiad.wmnet with reason: Maintenance [16:39:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1160 (T321130)', diff saved to https://phabricator.wikimedia.org/P39544 and previous config saved to /var/cache/conftool/dbconfig/20221114-163910-marostegui.json [16:39:14] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [16:39:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T322618)', diff saved to https://phabricator.wikimedia.org/P39545 and previous config saved to /var/cache/conftool/dbconfig/20221114-163954-ladsgroup.json [16:39:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:39:59] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [16:40:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [16:40:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T322618)', diff saved to https://phabricator.wikimedia.org/P39546 and previous config saved to /var/cache/conftool/dbconfig/20221114-164015-ladsgroup.json [16:40:18] PROBLEM - pybal on lvs4005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [16:40:23] PROBLEM - PyBal backends health check on lvs4005 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 (Connection refused) https://wikitech.wikimedia.org/wiki/PyBal [16:41:08] (03CR) 10Elukey: [C: 03+2] ml-services: update revert-risk's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/856609 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou) [16:41:27] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on lvs4005.ulsfo.wmnet with reason: downtimed, in the process of decom [16:41:42] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs4005.ulsfo.wmnet with reason: downtimed, in the process of decom [16:41:42] !log depooled lvs4005 [16:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:45] RECOVERY - MariaDB read only s8 on db2166 is OK: Version 10.4.25-MariaDB-log, Uptime 25s, read_only: True, event_scheduler: True, 11.65 QPS, connection latency: 0.006150s, query latency: 0.002689s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [16:43:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T322618)', diff saved to https://phabricator.wikimedia.org/P39547 and previous config saved to /var/cache/conftool/dbconfig/20221114-164327-ladsgroup.json [16:43:54] (03PS1) 10Ssingh: lvs4008: set override_bgp_med to 0 [puppet] - 10https://gerrit.wikimedia.org/r/856610 [16:46:12] (03CR) 10Vgutierrez: [C: 04-1] "you shouldn't need to explicitly set the MED to 0, just drop the custom MED and set it as the primary LVS for high-traffic1" [puppet] - 10https://gerrit.wikimedia.org/r/856610 (owner: 10Ssingh) [16:47:24] (03CR) 10Ssingh: lvs4008: set override_bgp_med to 0 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856610 (owner: 10Ssingh) [16:48:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P39548 and previous config saved to /var/cache/conftool/dbconfig/20221114-164818-marostegui.json [16:50:22] (03CR) 10Vgutierrez: [C: 04-1] lvs4008: set override_bgp_med to 0 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856610 (owner: 10Ssingh) [16:51:30] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856611 (https://phabricator.wikimedia.org/T128546) [16:51:44] (03PS2) 10Ssingh: lvs4008: set as high-traffic1 primary LVS and remove lvs4005 [puppet] - 10https://gerrit.wikimedia.org/r/856610 [16:53:43] (03PS1) 10Herron: dispatch: add apache redirect from default org to wikimedia org [puppet] - 10https://gerrit.wikimedia.org/r/856612 (https://phabricator.wikimedia.org/T313229) [16:54:02] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856611 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:54:13] PROBLEM - dispatch.wikimedia.org requires authentication on alert1001 is CRITICAL: HTTP CRITICAL: Status line output matched HTTP/1.1 302 - header ocation: https://idp.wikim... not found on https://dispatch.wikimedia.org:443/ - 552 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:54:55] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/856611 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:55:13] 10SRE, 10Traffic: SSL CRITICAL - OCSP staple validity for www.wikipedia.bg has X seconds left - https://phabricator.wikimedia.org/T243948 (10BCornwall) [16:55:17] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: acme-chief should be able to refresh OCSP stapling response even if the renewal process fails - https://phabricator.wikimedia.org/T244232 (10BCornwall) 05In progress→03Resolved [16:56:59] 10SRE, 10Wikibase Product Platform, 10Wikimedia-Apache-configuration, 10serviceops: Incorrect handling of ETags taking precedence over timestamps in conditional requests - https://phabricator.wikimedia.org/T320241 (10jijiki) [16:58:11] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/pcc-worker1002/38147/" [puppet] - 10https://gerrit.wikimedia.org/r/856610 (owner: 10Ssingh) [16:58:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P39549 and previous config saved to /var/cache/conftool/dbconfig/20221114-165833-ladsgroup.json [16:58:45] 10SRE, 10Acme-chief, 10Cloud-VPS, 10Traffic-Icebox, and 2 others: acme-chief shouldn't try to perform OCSP stapling of expired certs - https://phabricator.wikimedia.org/T262251 (10BCornwall) 05In progress→03Resolved [16:59:40] (03PS1) 10Muehlenhoff: Sync to 6.6.2 of the CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/856616 [16:59:56] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:856611| Bumping portals to master (T128546)]] (duration: 03m 58s) [17:00:00] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [17:00:01] (03PS2) 10Muehlenhoff: Sync to 6.6.2 of the CAS overlay [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/856616 [17:00:14] (03PS22) 10Filippo Giunchedi: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [17:00:38] PROBLEM - Host db2173.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:00:54] (03CR) 10Muehlenhoff: Sync to 6.6.2 of the CAS overlay (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/856480 (owner: 10Muehlenhoff) [17:01:19] (03CR) 10Filippo Giunchedi: "LGTM overall, I've reworked a little the parsing to be more clear and added some basic tests, let me know what you think!" [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [17:01:32] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 189 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:03:08] (03PS23) 10Filippo Giunchedi: centrallog: add first prototype of webrequest-live with Benthos [puppet] - 10https://gerrit.wikimedia.org/r/854499 (https://phabricator.wikimedia.org/T314981) (owner: 10Elukey) [17:03:16] !log sukhe@cumin2002 START - Cookbook sre.hosts.decommission for hosts lvs4005.ulsfo.wmnet [17:03:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T321126)', diff saved to https://phabricator.wikimedia.org/P39550 and previous config saved to /var/cache/conftool/dbconfig/20221114-170325-marostegui.json [17:03:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1201.eqiad.wmnet with reason: Maintenance [17:03:30] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [17:03:36] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:03:41] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:856611| Bumping portals to master (T128546)]] (duration: 03m 45s) [17:03:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1201.eqiad.wmnet with reason: Maintenance [17:03:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1201 (T321126)', diff saved to https://phabricator.wikimedia.org/P39551 and previous config saved to /var/cache/conftool/dbconfig/20221114-170357-marostegui.json [17:04:57] (03PS10) 10Andrew Bogott: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 [17:06:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T321126)', diff saved to https://phabricator.wikimedia.org/P39552 and previous config saved to /var/cache/conftool/dbconfig/20221114-170609-marostegui.json [17:07:26] !log sukhe@cumin2002 START - Cookbook sre.dns.netbox [17:07:40] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 12 hosts with reason: Reboot for kernel update [17:07:42] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on 12 hosts with reason: Reboot for kernel update [17:07:50] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wcqs[2001-2003].codfw.wmnet with reason: Reboot for kernel update [17:08:04] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wcqs[2001-2003].codfw.wmnet with reason: Reboot for kernel update [17:09:48] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:09:49] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts lvs4005.ulsfo.wmnet [17:09:56] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install ulsfo misc class hosts - https://phabricator.wikimedia.org/T317247 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by sukhe@cumin2002 for hosts: `lvs4005.ulsfo.wmnet` - lvs4005.ulsfo.wmnet (**WARN**) - D... [17:10:23] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/856616 (owner: 10Muehlenhoff) [17:10:37] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 2:00:00 on wcqs1002.eqiad.wmnet with reason: Reboot for kernel update [17:10:51] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 2:00:00 on wcqs1002.eqiad.wmnet with reason: Reboot for kernel update [17:11:18] RECOVERY - Host db2173.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.82 ms [17:13:20] !log dancy@deploy1002 Installing scap version "4.28.1" for 559 hosts [17:13:26] (03CR) 10Andrew Bogott: [C: 04-1] Add upgrade_openstack_node.py (034 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [17:13:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P39553 and previous config saved to /var/cache/conftool/dbconfig/20221114-171340-ladsgroup.json [17:13:50] !log dancy@deploy1002 Installation of scap version "4.28.1" completed for 559 hosts [17:14:53] 10SRE, 10conftool, 10serviceops: Not all confd errors throw icinga alerts - https://phabricator.wikimedia.org/T110933 (10jijiki) 05Open→03Declined Bluntly closing this as there has been no update for quite some years now [17:21:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P39554 and previous config saved to /var/cache/conftool/dbconfig/20221114-172116-marostegui.json [17:21:40] RECOVERY - dispatch.wikimedia.org requires authentication on alert1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 588 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [17:24:06] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:24:33] (03CR) 10Vgutierrez: [C: 03+1] lvs4008: set as high-traffic1 primary LVS and remove lvs4005 [puppet] - 10https://gerrit.wikimedia.org/r/856610 (owner: 10Ssingh) [17:25:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:26:29] 10SRE, 10conftool, 10serviceops: confctl no longer logs a non-changing state change - https://phabricator.wikimedia.org/T161096 (10MoritzMuehlenhoff) 05Open→03Declined After five years we can now consider the established status quo, let's just keep it as-is. [17:28:01] (03Abandoned) 10Ssingh: Depool ulsfo for resolving varnish issues [dns] - 10https://gerrit.wikimedia.org/r/855987 (https://phabricator.wikimedia.org/T322903) (owner: 10Ssingh) [17:28:21] (03CR) 10Ssingh: [C: 03+2] lvs4008: set as high-traffic1 primary LVS and remove lvs4005 [puppet] - 10https://gerrit.wikimedia.org/r/856610 (owner: 10Ssingh) [17:28:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T322618)', diff saved to https://phabricator.wikimedia.org/P39555 and previous config saved to /var/cache/conftool/dbconfig/20221114-172846-ladsgroup.json [17:28:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:28:51] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [17:29:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [17:29:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T322618)', diff saved to https://phabricator.wikimedia.org/P39556 and previous config saved to /var/cache/conftool/dbconfig/20221114-172929-ladsgroup.json [17:31:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T322618)', diff saved to https://phabricator.wikimedia.org/P39557 and previous config saved to /var/cache/conftool/dbconfig/20221114-173140-ladsgroup.json [17:32:52] 10SRE, 10ops-codfw, 10decommission-hardware, 10serviceops-collab: decommission phab2001.codfw.wmnet - https://phabricator.wikimedia.org/T322880 (10Papaul) [17:33:25] 10SRE, 10ops-codfw, 10decommission-hardware, 10serviceops-collab: decommission phab2001.codfw.wmnet - https://phabricator.wikimedia.org/T322880 (10Papaul) 05Open→03Resolved complete [17:34:36] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['arclamp2001'] [17:36:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P39558 and previous config saved to /var/cache/conftool/dbconfig/20221114-173622-marostegui.json [17:39:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T321130)', diff saved to https://phabricator.wikimedia.org/P39559 and previous config saved to /var/cache/conftool/dbconfig/20221114-173925-marostegui.json [17:39:31] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [17:41:59] !log Restored CI caching mechanism which has been serving stalled caches since March 29th 2022 :-\ T307334 [17:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:04] T307334: Upgrade to Gerrit 3.5 - https://phabricator.wikimedia.org/T307334 [17:42:11] grrr wrong task [17:42:27] !log Restored CI caching mechanism which has been serving stalled caches since March 29th 2022 :-\ T323051 [17:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:31] T323051: Castor cache is ineffective due to a split brain - https://phabricator.wikimedia.org/T323051 [17:44:26] (03PS1) 10Hnowlan: swift: reenable more logging [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/856629 [17:46:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P39560 and previous config saved to /var/cache/conftool/dbconfig/20221114-174647-ladsgroup.json [17:51:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T321126)', diff saved to https://phabricator.wikimedia.org/P39561 and previous config saved to /var/cache/conftool/dbconfig/20221114-175129-marostegui.json [17:51:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [17:51:34] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [17:51:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [17:51:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2117.codfw.wmnet with reason: Maintenance [17:52:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2117.codfw.wmnet with reason: Maintenance [17:52:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T321126)', diff saved to https://phabricator.wikimedia.org/P39562 and previous config saved to /var/cache/conftool/dbconfig/20221114-175213-marostegui.json [17:54:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P39563 and previous config saved to /var/cache/conftool/dbconfig/20221114-175432-marostegui.json [17:55:08] PROBLEM - SSH on mw1337.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:00:05] ryankemper: gettimeofday() says it's time for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221114T1800) [18:00:36] (03CR) 10Andrew Bogott: [C: 03+2] P:metricsinfra::alertmanager: update default target [puppet] - 10https://gerrit.wikimedia.org/r/854494 (owner: 10Majavah) [18:00:43] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/854494 (owner: 10Majavah) [18:01:36] (03CR) 10Vlad.shapik: [C: 03+1] swift: reenable more logging [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/856629 (owner: 10Hnowlan) [18:01:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P39564 and previous config saved to /var/cache/conftool/dbconfig/20221114-180153-ladsgroup.json [18:07:31] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts ['arclamp2001'] [18:07:36] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['arclamp2001'] [18:07:56] (03CR) 10Dzahn: gerrit: script to report on git gc durations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [18:08:06] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['arclamp2001'] [18:09:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P39565 and previous config saved to /var/cache/conftool/dbconfig/20221114-180938-marostegui.json [18:09:57] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Papaul) [18:13:27] (03CR) 10Hashar: [C: 03+2] "Also found out Castor cache was broken and fixed it with T323051 . There should be a fully populated cache now." [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [18:15:35] (03CR) 10Dzahn: "script works for me. the current result is: 0:31:26 TOTAL" [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [18:17:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T322618)', diff saved to https://phabricator.wikimedia.org/P39566 and previous config saved to /var/cache/conftool/dbconfig/20221114-181700-ladsgroup.json [18:17:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:17:06] T322618: Fix renamed indexes of flaggedrevs_tracking table in production - https://phabricator.wikimedia.org/T322618 [18:17:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [18:17:17] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2052 - https://phabricator.wikimedia.org/T320482 (10Papaul) @RKemper @Gehel disk replaced [18:18:34] PROBLEM - Check systemd state on elastic2052 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service,elasticsearch_7@production-search-codfw.service,elasticsearch_7@production-search-omega-codfw.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:47] (03PS2) 10Hashar: gerrit: script to report on git gc durations [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) [18:19:28] (03CR) 10Volans: [C: 04-1] gerrit: script to report on git gc durations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [18:19:33] (03Merged) 10jenkins-bot: Merge tag 'v3.5.4' into wmf/stable-3.5 [software/gerrit] (wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/856555 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [18:19:35] (03CR) 10Hashar: "Tyler wrote the script and I uploaded it for him, I think that would requires a +1 from him to validate he hereby place his script under A" [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [18:20:30] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [18:22:24] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [18:23:46] AH the Gerrit change finally merged `\o/` [18:23:48] dinner time! [18:24:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T321130)', diff saved to https://phabricator.wikimedia.org/P39567 and previous config saved to /var/cache/conftool/dbconfig/20221114-182445-marostegui.json [18:24:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P39568 and previous config saved to /var/cache/conftool/dbconfig/20221114-182446-marostegui.json [18:24:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1190.eqiad.wmnet with reason: Maintenance [18:24:50] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:25:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1190.eqiad.wmnet with reason: Maintenance [18:25:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1190 (T321130)', diff saved to https://phabricator.wikimedia.org/P39569 and previous config saved to /var/cache/conftool/dbconfig/20221114-182506-marostegui.json [18:31:42] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [18:33:31] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Dzahn) [18:33:34] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [18:34:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Dzahn) updated subteam contacts based on T316223#8381863 [18:34:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn) [18:35:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319433 (10Dzahn) updated sub-team contacts based on T316223#8381863 [18:37:21] (03PS2) 10Dzahn: phabricator: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) [18:37:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T321130)', diff saved to https://phabricator.wikimedia.org/P39570 and previous config saved to /var/cache/conftool/dbconfig/20221114-183738-marostegui.json [18:37:43] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [18:37:56] (03CR) 10CI reject: [V: 04-1] phabricator: remove phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [18:39:47] (03PS3) 10Dzahn: phabricator: remove production role from phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/824804 (https://phabricator.wikimedia.org/T280597) [18:39:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117 (T321126)', diff saved to https://phabricator.wikimedia.org/P39571 and previous config saved to /var/cache/conftool/dbconfig/20221114-183952-marostegui.json [18:39:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2124.codfw.wmnet with reason: Maintenance [18:39:58] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [18:40:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2124.codfw.wmnet with reason: Maintenance [18:40:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T321126)', diff saved to https://phabricator.wikimedia.org/P39572 and previous config saved to /var/cache/conftool/dbconfig/20221114-184014-marostegui.json [18:42:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T321126)', diff saved to https://phabricator.wikimedia.org/P39573 and previous config saved to /var/cache/conftool/dbconfig/20221114-184235-marostegui.json [18:45:54] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [18:48:04] RECOVERY - Check systemd state on elastic2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:50:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [18:50:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [18:51:06] (03PS1) 10Jbond: utils: add additional selectors to pcc.py [puppet] - 10https://gerrit.wikimedia.org/r/856643 [18:51:52] (03CR) 10CI reject: [V: 04-1] utils: add additional selectors to pcc.py [puppet] - 10https://gerrit.wikimedia.org/r/856643 (owner: 10Jbond) [18:52:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P39574 and previous config saved to /var/cache/conftool/dbconfig/20221114-185244-marostegui.json [18:54:43] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [18:55:56] RECOVERY - SSH on mw1337.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:56:53] (03CR) 10Andrew Bogott: [C: 04-1] Add upgrade_openstack_node.py (034 comments) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [18:57:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P39575 and previous config saved to /var/cache/conftool/dbconfig/20221114-185741-marostegui.json [18:58:22] (03PS4) 10Andrew Bogott: Rename live_upgrade_ussuri_to_victoria.py to remove version-specific name [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852312 [18:58:29] (03PS11) 10Andrew Bogott: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 [19:01:25] (03PS1) 10Andrew Bogott: cloudbackup100[12]-dev to openstack version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/856645 (https://phabricator.wikimedia.org/T305828) [19:02:01] (03CR) 10CI reject: [V: 04-1] Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [19:06:13] (03CR) 10Andrew Bogott: [C: 03+2] Rename live_upgrade_ussuri_to_victoria.py to remove version-specific name [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852312 (owner: 10Andrew Bogott) [19:07:48] PROBLEM - SSH on mw1312.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:07:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P39576 and previous config saved to /var/cache/conftool/dbconfig/20221114-190750-marostegui.json [19:08:28] (03PS12) 10Andrew Bogott: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 [19:08:46] (03CR) 10Andrew Bogott: [C: 03+2] cloudbackup100[12]-dev to openstack version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/856645 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [19:12:36] (03PS1) 10Andrew Bogott: Horizon: put into maintenance mode for Yoga upgrade [puppet] - 10https://gerrit.wikimedia.org/r/856647 (https://phabricator.wikimedia.org/T305828) [19:12:39] (03PS1) 10Andrew Bogott: eqiad1 OpenStack to version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/856648 (https://phabricator.wikimedia.org/T305828) [19:12:40] (03PS1) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Yoga upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/856649 [19:12:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P39577 and previous config saved to /var/cache/conftool/dbconfig/20221114-191247-marostegui.json [19:13:33] (03CR) 10Andrew Bogott: [C: 03+2] "Thank you Antoine!" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [19:16:58] (03Merged) 10jenkins-bot: Add upgrade_openstack_node.py [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/852313 (owner: 10Andrew Bogott) [19:19:19] (03CR) 10Nskaggs: "Confirmed working for me to add/remove myself from projects. I didn't yet exhaustively test all iterations of project admin or user" [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [19:20:51] (03PS2) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Yoga upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/856649 [19:20:53] (03PS1) 10Andrew Bogott: eqiad1 Horizon to Openstack version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/856650 [19:21:57] (03CR) 10Nskaggs: wmcs: add cookbook to add/remove a user to/from a project (031 comment) [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/851650 (owner: 10David Caro) [19:22:30] (03CR) 10CI reject: [V: 04-1] eqiad1 Horizon to Openstack version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/856650 (owner: 10Andrew Bogott) [19:22:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T321130)', diff saved to https://phabricator.wikimedia.org/P39578 and previous config saved to /var/cache/conftool/dbconfig/20221114-192257-marostegui.json [19:22:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db1199.eqiad.wmnet with reason: Maintenance [19:23:02] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [19:23:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db1199.eqiad.wmnet with reason: Maintenance [19:23:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1199 (T321130)', diff saved to https://phabricator.wikimedia.org/P39579 and previous config saved to /var/cache/conftool/dbconfig/20221114-192318-marostegui.json [19:25:21] (03PS1) 10Papaul: Add new node to site.pp and to netboot [puppet] - 10https://gerrit.wikimedia.org/r/856651 (https://phabricator.wikimedia.org/T319428) [19:26:09] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put into maintenance mode for Yoga upgrade [puppet] - 10https://gerrit.wikimedia.org/r/856647 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [19:26:39] (03CR) 10Papaul: [C: 03+2] Add new node to site.pp and to netboot [puppet] - 10https://gerrit.wikimedia.org/r/856651 (https://phabricator.wikimedia.org/T319428) (owner: 10Papaul) [19:27:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T321126)', diff saved to https://phabricator.wikimedia.org/P39580 and previous config saved to /var/cache/conftool/dbconfig/20221114-192754-marostegui.json [19:27:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2129.codfw.wmnet with reason: Maintenance [19:27:59] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [19:28:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2129.codfw.wmnet with reason: Maintenance [19:28:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2129 (T321126)', diff saved to https://phabricator.wikimedia.org/P39581 and previous config saved to /var/cache/conftool/dbconfig/20221114-192816-marostegui.json [19:29:08] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1 OpenStack to version 'yoga' [puppet] - 10https://gerrit.wikimedia.org/r/856648 (https://phabricator.wikimedia.org/T305828) (owner: 10Andrew Bogott) [19:29:57] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host dbprov2004.codfw.wmnet with OS bullseye [19:30:06] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup, 10Patch-For-Review: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host dbprov2004.codfw.wmnet with OS bullseye [19:30:18] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:30:18] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:30:34] labweb, you say [19:30:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T321126)', diff saved to https://phabricator.wikimedia.org/P39582 and previous config saved to /var/cache/conftool/dbconfig/20221114-193037-marostegui.json [19:30:50] Uh? [19:31:16] andrewbogott: any chance that labweb-ssl prober alert is related to ongoing work? [19:31:29] probably [19:32:08] did it page you? [19:32:20] Yes [19:32:25] it did :) need me to do anything, or should I leave the investigation for you? [19:33:31] It's expected, part of an upgrade. I don't think I'm familiar with that alert though (and it didn't show up on my board or page me). That's an lvs thing? [19:33:39] vgutierrez: Shouldn't you have not been paged? [19:34:21] Anyway, I'm making a log of things that need better silencing during this upgrade so if you tell me where/what to silence I'll add it to the list :) [19:35:36] brett: got paged via the irc message including hashtag p.a.g.e [19:35:45] And now via victorops [19:35:47] andrewbogott: these are https://wikitech.wikimedia.org/wiki/Network_monitoring#Blackbox_Probes_(Prometheus) [19:35:58] I gues nobody acked the incident [19:36:00] everybody got paged now [19:36:10] ok, thanks. Surprised they aren't on my board, I guess they aren't tagged as wmcs [19:36:10] that's my fault for not acking, sorry [19:36:11] I received this but see backlog [19:36:13] shit, sorry, was in a meeting and didn't ack [19:36:25] for everyone else who just got paged, no action needed and I owe you a beer [19:36:31] was just keeping an eye on since it looked like andrewbogott was on it [19:36:35] go back to your evenings :) [19:37:26] andrewbogott: for your list note also the auto-generated runbook url in the alert: https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 -- if you could populate that section for the next time this fires, it'd be much appreciated [19:37:44] ok! [19:40:49] should we send the ACK via SMS? (so that people who got the alert but were not on computer know) [19:41:23] if they look at VO they will see it acked, I don't want to buzz everyone's phone *again* [19:41:42] (03PS1) 10CDanis: Fix duplicate definition of httpreqrate table [puppet] - 10https://gerrit.wikimedia.org/r/856652 (https://phabricator.wikimedia.org/T306580) [19:41:44] or, better yet, since andrewbogott is on top of this, I'll just resolve it in VO, which *will* buzz everyone's phone but the good kind [19:41:52] yea, that's like the other side of it and it's a good point, ack about the acks [19:43:18] I'm not sure I can even see that alert in VO but I will search a bit more [19:45:17] andrewbogott: https://portal.victorops.com/ui/wikimedia/incident/3150/details [19:45:28] (03PS1) 10Ebernhardson: snapshot: Apply minor cleanups to cirrus dump script [puppet] - 10https://gerrit.wikimedia.org/r/856653 [19:45:32] (03PS1) 10Ebernhardson: snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) [19:45:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P39583 and previous config saved to /var/cache/conftool/dbconfig/20221114-194543-marostegui.json [19:46:35] (03CR) 10CI reject: [V: 04-1] snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [19:47:10] (03CR) 10CDanis: [C: 03+2] "pcc lgtm https://puppet-compiler.wmflabs.org/pcc-worker1002/38150/" [puppet] - 10https://gerrit.wikimedia.org/r/856652 (https://phabricator.wikimedia.org/T306580) (owner: 10CDanis) [19:50:39] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: Upgrade horizon to Z to prepare for Openstack upgrades past Wallaby -- T305828 [19:50:44] T305828: upgrade cloud-vps openstack to Openstack version 'Yoga' - https://phabricator.wikimedia.org/T305828 [19:53:30] (03PS2) 10Ebernhardson: snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) [19:54:27] (03CR) 10CI reject: [V: 04-1] snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [19:55:21] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: Upgrade horizon to Z to prepare for Openstack upgrades past Wallaby -- T305828 (duration: 04m 41s) [19:58:19] (03PS3) 10Ebernhardson: snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) [19:58:22] (03PS1) 10Ebernhardson: snapshot: Remove absented cirrus dump job [puppet] - 10https://gerrit.wikimedia.org/r/856655 [19:59:18] (03CR) 10CI reject: [V: 04-1] snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [20:00:31] (03PS4) 10Ebernhardson: snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) [20:00:33] (03PS2) 10Ebernhardson: snapshot: Remove absented cirrus dump job [puppet] - 10https://gerrit.wikimedia.org/r/856655 [20:00:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129', diff saved to https://phabricator.wikimedia.org/P39585 and previous config saved to /var/cache/conftool/dbconfig/20221114-200050-marostegui.json [20:01:15] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:01:27] (03CR) 10CI reject: [V: 04-1] snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [20:01:32] (03CR) 10Ebernhardson: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38153/console" [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) (owner: 10Ebernhardson) [20:05:08] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:06:19] (03CR) 10Thcipriani: [C: 03+1] gerrit: script to report on git gc durations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [20:08:38] RECOVERY - SSH on mw1312.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:11:02] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is CRITICAL: CRITICAL - Expecting active but unit nfs-exportd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:13:00] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1004 is OK: OK - nfs-exportd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:14:56] PROBLEM - SSH on db1123.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:15:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2129 (T321126)', diff saved to https://phabricator.wikimedia.org/P39586 and previous config saved to /var/cache/conftool/dbconfig/20221114-201556-marostegui.json [20:15:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2141.codfw.wmnet with reason: Maintenance [20:16:02] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [20:16:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2141.codfw.wmnet with reason: Maintenance [20:16:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2158.codfw.wmnet with reason: Maintenance [20:16:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2158.codfw.wmnet with reason: Maintenance [20:16:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [20:16:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2095.codfw.wmnet with reason: Maintenance [20:16:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T321126)', diff saved to https://phabricator.wikimedia.org/P39587 and previous config saved to /var/cache/conftool/dbconfig/20221114-201650-marostegui.json [20:19:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T321126)', diff saved to https://phabricator.wikimedia.org/P39588 and previous config saved to /var/cache/conftool/dbconfig/20221114-201911-marostegui.json [20:20:42] (03CR) 10Dzahn: phabricator: add parameter for mysql port, set it to 3323 if using slave (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/856013 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:20:57] (03PS2) 10Dzahn: phabricator: add parameter for mysql port, set it to 3323 if using slave [puppet] - 10https://gerrit.wikimedia.org/r/856013 (https://phabricator.wikimedia.org/T280597) [20:23:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T321130)', diff saved to https://phabricator.wikimedia.org/P39589 and previous config saved to /var/cache/conftool/dbconfig/20221114-202334-marostegui.json [20:23:40] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [20:49:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P39592 and previous config saved to /var/cache/conftool/dbconfig/20221114-204924-marostegui.json [20:52:39] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Horizon: put into maintenance mode for Yoga upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/856649 (owner: 10Andrew Bogott) [20:52:59] (03CR) 10Andrew Bogott: [C: 03+2] live_upgrade_openstack: Remove confirmation prompt at the beginning [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856666 (owner: 10Andrew Bogott) [20:53:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P39593 and previous config saved to /var/cache/conftool/dbconfig/20221114-205347-marostegui.json [20:55:18] (ProbeDown) resolved: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:55:19] (ProbeDown) resolved: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:56:28] (03Merged) 10jenkins-bot: live_upgrade_openstack: Remove confirmation prompt at the beginning [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/856666 (owner: 10Andrew Bogott) [20:57:04] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "looks good here, no change on phab1001 but changes the port on the other 2 servers: https://puppet-compiler.wmflabs.org/pcc-worker1002/38" [puppet] - 10https://gerrit.wikimedia.org/r/856013 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [20:57:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:57:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:00:04] (03PS4) 10Hashar: Gerrit v3.5.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824200 (https://phabricator.wikimedia.org/T307334) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221114T2100). [21:00:05] No Gerrit patches in the queue for this window AFAICS. [21:00:31] (03CR) 10CI reject: [V: 04-1] Gerrit v3.5.4 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824200 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [21:00:37] (03PS1) 10Jbond: apereo_cas: update beaker tests [puppet] - 10https://gerrit.wikimedia.org/r/856668 [21:01:20] (03CR) 10CI reject: [V: 04-1] apereo_cas: update beaker tests [puppet] - 10https://gerrit.wikimedia.org/r/856668 (owner: 10Jbond) [21:01:43] (03PS2) 10Jbond: apereo_cas: update beaker tests [puppet] - 10https://gerrit.wikimedia.org/r/856668 [21:01:54] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [21:02:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:02:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:03:56] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/856670 [21:03:59] (03CR) 10Dzahn: [C: 03+2] phabricator: add parameter for mysql port, set it to 3323 if using slave [puppet] - 10https://gerrit.wikimedia.org/r/856013 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:04:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T321126)', diff saved to https://phabricator.wikimedia.org/P39594 and previous config saved to /var/cache/conftool/dbconfig/20221114-210430-marostegui.json [21:04:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2169.codfw.wmnet with reason: Maintenance [21:04:35] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [21:04:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2169.codfw.wmnet with reason: Maintenance [21:05:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39595 and previous config saved to /var/cache/conftool/dbconfig/20221114-210503-marostegui.json [21:05:09] (03PS3) 10Jbond: apereo_cas: update beaker tests [puppet] - 10https://gerrit.wikimedia.org/r/856668 [21:06:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39596 and previous config saved to /var/cache/conftool/dbconfig/20221114-210623-marostegui.json [21:06:32] (03CR) 10Dzahn: [C: 03+2] "well, it does change the config on prod server because now it sets the port specifically instead of falling back to defaults. even nicer w" [puppet] - 10https://gerrit.wikimedia.org/r/856013 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:07:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:07:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:08:43] (03Abandoned) 10Andrew Bogott: eqiad1 Horizon to Openstack version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/856650 (owner: 10Andrew Bogott) [21:08:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T321130)', diff saved to https://phabricator.wikimedia.org/P39597 and previous config saved to /var/cache/conftool/dbconfig/20221114-210853-marostegui.json [21:08:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:08:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [21:08:59] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [21:10:07] (03CR) 10Andrew Bogott: [C: 03+2] add wmcs-securitygroup-backfill [puppet] - 10https://gerrit.wikimedia.org/r/850592 (https://phabricator.wikimedia.org/T288108) (owner: 10Andrew Bogott) [21:11:58] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [21:16:50] (03CR) 10Hashar: "recheck after having uploaded artifacts to Archiva" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824200 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [21:18:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2099.codfw.wmnet with reason: Maintenance [21:19:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2099.codfw.wmnet with reason: Maintenance [21:21:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P39598 and previous config saved to /var/cache/conftool/dbconfig/20221114-212130-marostegui.json [21:21:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:21:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:25:08] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [21:26:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:26:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:27:34] (03CR) 10BCornwall: [C: 03+1] prometheus/trafficserver: Remove node_ats_config [puppet] - 10https://gerrit.wikimedia.org/r/856593 (https://phabricator.wikimedia.org/T292815) (owner: 10Vgutierrez) [21:27:45] (03CR) 10Dzahn: [C: 03+2] "somewhat unexpectedly it's a complete noop on phab1001. there is this code structure here:" [puppet] - 10https://gerrit.wikimedia.org/r/856013 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:29:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2106.codfw.wmnet with reason: Maintenance [21:29:28] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2106.codfw.wmnet with reason: Maintenance [21:29:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2106 (T321130)', diff saved to https://phabricator.wikimedia.org/P39599 and previous config saved to /var/cache/conftool/dbconfig/20221114-212934-marostegui.json [21:29:39] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [21:33:16] (03CR) 10Dzahn: [C: 03+2] "on the other 2 hosts this changed the value in the actual file /etc/phabtools.conf" [puppet] - 10https://gerrit.wikimedia.org/r/856013 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:34:02] (03CR) 10Dzahn: [C: 03+2] "though.. it already knew about the slave and slave port .. in other ways:" [puppet] - 10https://gerrit.wikimedia.org/r/856013 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:35:13] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [21:35:21] (03PS1) 10Arturo Borrero Gonzalez: ceph: osd: factorize config read [puppet] - 10https://gerrit.wikimedia.org/r/856674 [21:35:23] (03PS1) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) [21:35:35] !log phab2002 - systemctl start phd, debug why it still fails [21:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:51] (03PS5) 10Ebernhardson: snapshot: Parallelize cirrus dumps by db shard [puppet] - 10https://gerrit.wikimedia.org/r/856654 (https://phabricator.wikimedia.org/T265056) [21:35:53] (03PS3) 10Ebernhardson: snapshot: Remove absented cirrus dump job [puppet] - 10https://gerrit.wikimedia.org/r/856655 [21:36:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P39600 and previous config saved to /var/cache/conftool/dbconfig/20221114-213636-marostegui.json [21:37:56] (03PS2) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) [21:38:36] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [21:38:37] (03CR) 10CI reject: [V: 04-1] ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [21:40:50] (03CR) 10Dzahn: [C: 03+2] "manually starting the "phd" service on phab2002 still fails and debugging shows it is still the DB connection" [puppet] - 10https://gerrit.wikimedia.org/r/856013 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [21:41:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T321130)', diff saved to https://phabricator.wikimedia.org/P39601 and previous config saved to /var/cache/conftool/dbconfig/20221114-214125-marostegui.json [21:41:30] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [21:42:01] (03CR) 10Andrew Bogott: [C: 03+1] ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [21:43:24] (03PS3) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) [21:46:28] (03CR) 10CI reject: [V: 04-1] ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [21:47:35] (03PS4) 10Arturo Borrero Gonzalez: ceph: osd: introduce support for single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) [21:48:41] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [21:49:25] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/pcc-worker1001/38156/" [puppet] - 10https://gerrit.wikimedia.org/r/856675 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [21:51:38] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [21:51:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39602 and previous config saved to /var/cache/conftool/dbconfig/20221114-215143-marostegui.json [21:51:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance [21:51:50] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [21:51:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2171.codfw.wmnet with reason: Maintenance [21:52:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2171:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39603 and previous config saved to /var/cache/conftool/dbconfig/20221114-215204-marostegui.json [21:54:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39604 and previous config saved to /var/cache/conftool/dbconfig/20221114-215425-marostegui.json [21:56:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P39605 and previous config saved to /var/cache/conftool/dbconfig/20221114-215631-marostegui.json [21:56:34] (03PS1) 10Andrew Bogott: Forward wmcs-securitygroup-backfill.py to version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/856684 [21:57:29] (03CR) 10Andrew Bogott: [C: 03+2] Forward wmcs-securitygroup-backfill.py to version Yoga [puppet] - 10https://gerrit.wikimedia.org/r/856684 (owner: 10Andrew Bogott) [21:58:56] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [21:58:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:58:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:00:05] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221114T2200) [22:03:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:03:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:03:59] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [22:09:01] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [22:09:07] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [22:09:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P39606 and previous config saved to /var/cache/conftool/dbconfig/20221114-220932-marostegui.json [22:11:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106', diff saved to https://phabricator.wikimedia.org/P39607 and previous config saved to /var/cache/conftool/dbconfig/20221114-221138-marostegui.json [22:16:46] RECOVERY - SSH on db1123.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:19:12] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [22:21:18] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [22:24:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316', diff saved to https://phabricator.wikimedia.org/P39608 and previous config saved to /var/cache/conftool/dbconfig/20221114-222438-marostegui.json [22:26:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2106 (T321130)', diff saved to https://phabricator.wikimedia.org/P39609 and previous config saved to /var/cache/conftool/dbconfig/20221114-222644-marostegui.json [22:26:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2110.codfw.wmnet with reason: Maintenance [22:26:50] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [22:27:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2110.codfw.wmnet with reason: Maintenance [22:27:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2110 (T321130)', diff saved to https://phabricator.wikimedia.org/P39610 and previous config saved to /var/cache/conftool/dbconfig/20221114-222706-marostegui.json [22:28:53] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:30:02] looking into this [22:31:23] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [22:33:53] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:38:53] (JobUnavailable) resolved: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:39:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2171:3316 (T321126)', diff saved to https://phabricator.wikimedia.org/P39611 and previous config saved to /var/cache/conftool/dbconfig/20221114-223945-marostegui.json [22:39:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2180.codfw.wmnet with reason: Maintenance [22:39:50] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [22:40:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2180.codfw.wmnet with reason: Maintenance [22:40:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T321126)', diff saved to https://phabricator.wikimedia.org/P39612 and previous config saved to /var/cache/conftool/dbconfig/20221114-224006-marostegui.json [22:40:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:40:58] (KubernetesRsyslogDown) firing: rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=aux-k8s-worker1001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:41:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T321130)', diff saved to https://phabricator.wikimedia.org/P39613 and previous config saved to /var/cache/conftool/dbconfig/20221114-224132-marostegui.json [22:41:37] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [22:42:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T321126)', diff saved to https://phabricator.wikimedia.org/P39614 and previous config saved to /var/cache/conftool/dbconfig/20221114-224224-marostegui.json [22:43:47] (03PS1) 10JHathaway: aux-k8s: fix bgp_peers [puppet] - 10https://gerrit.wikimedia.org/r/856694 (https://phabricator.wikimedia.org/T321120) [22:44:13] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/856694 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [22:44:22] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Papaul) @jcrespo are we installing the OS on this on the hds? [22:45:58] (KubernetesRsyslogDown) firing: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:50:29] (03CR) 10JHathaway: [C: 03+2] aux-k8s: fix bgp_peers [puppet] - 10https://gerrit.wikimedia.org/r/856694 (https://phabricator.wikimedia.org/T321120) (owner: 10JHathaway) [22:55:46] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [22:56:25] !log jhathaway@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [22:56:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P39616 and previous config saved to /var/cache/conftool/dbconfig/20221114-225638-marostegui.json [22:57:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P39617 and previous config saved to /var/cache/conftool/dbconfig/20221114-225730-marostegui.json [23:00:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on aux-k8s-ctrl1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:00:58] (KubernetesRsyslogDown) resolved: (2) rsyslog on aux-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:04:00] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [23:04:58] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 47 probes of 785 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:05:15] (JobUnavailable) firing: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:05:34] PROBLEM - SSH on aux-k8s-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:07:22] RECOVERY - SSH on aux-k8s-ctrl1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:09:54] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [23:10:15] (JobUnavailable) resolved: (2) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:11:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110', diff saved to https://phabricator.wikimedia.org/P39618 and previous config saved to /var/cache/conftool/dbconfig/20221114-231146-marostegui.json [23:12:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P39619 and previous config saved to /var/cache/conftool/dbconfig/20221114-231238-marostegui.json [23:16:54] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 23 probes of 785 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:21:04] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:21:42] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:22:06] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 89, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:22:45] (JobUnavailable) firing: Reduced availability for job k8s-api in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:22:58] (KubernetesAPILatency) firing: (3) High Kubernetes API latency (PUT leases) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:24:12] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:26:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2110 (T321130)', diff saved to https://phabricator.wikimedia.org/P39620 and previous config saved to /var/cache/conftool/dbconfig/20221114-232653-marostegui.json [23:26:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 5:00:00 on db2119.codfw.wmnet with reason: Maintenance [23:26:58] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [23:27:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on db2119.codfw.wmnet with reason: Maintenance [23:27:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2119 (T321130)', diff saved to https://phabricator.wikimedia.org/P39621 and previous config saved to /var/cache/conftool/dbconfig/20221114-232714-marostegui.json [23:27:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T321126)', diff saved to https://phabricator.wikimedia.org/P39622 and previous config saved to /var/cache/conftool/dbconfig/20221114-232744-marostegui.json [23:27:49] T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis - https://phabricator.wikimedia.org/T321126 [23:27:58] (KubernetesAPILatency) resolved: (6) High Kubernetes API latency (LIST cronjobs) on aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=aux-k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:31:30] (03CR) 10Hashar: "recheck" [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/824200 (https://phabricator.wikimedia.org/T307334) (owner: 10Hashar) [23:32:52] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbprov2004.codfw.wmnet with OS bullseye [23:33:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host dbprov2004.codfw.wmnet with OS bullseye executed with errors: - dbprov2004 (... [23:34:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: Q1:rack/setup/install dbprov2004 - https://phabricator.wikimedia.org/T321128 (10Papaul) [23:36:06] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host puppetdb2003.codfw.wmnet with OS bullseye [23:36:10] 10SRE, 10ops-codfw, 10DC-Ops: Q2:rack/setup/install puppetdb2003 - https://phabricator.wikimedia.org/T317894 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host puppetdb2003.codfw.wmnet with OS bullseye [23:36:48] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:37:30] PROBLEM - SSH on mw1328.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:37:45] (JobUnavailable) firing: (3) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:37:59] 10SRE, 10ops-codfw, 10DC-Ops: Q2:rack/setup/install puppetdb2003 - https://phabricator.wikimedia.org/T317894 (10Papaul) [23:39:12] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:39:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119 (T321130)', diff saved to https://phabricator.wikimedia.org/P39623 and previous config saved to /var/cache/conftool/dbconfig/20221114-233922-marostegui.json [23:39:27] T321130: Add column cuc_private to cu_changes on wmf wikis - https://phabricator.wikimedia.org/T321130 [23:39:43] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:41:12] PROBLEM - SSH on aux-k8s-ctrl1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:42:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:42:54] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:44:12] (KubernetesCalicoDown) firing: (2) aux-k8s-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:47:45] (JobUnavailable) firing: (5) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:48:22] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10Papaul) [23:50:44] RECOVERY - SSH on aux-k8s-ctrl1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:52:10] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host arclamp2001.codfw.wmnet with OS bullseye [23:52:15] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install arclamp2001.codfw.wmnet - https://phabricator.wikimedia.org/T319428 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host arclamp2001.codfw.wmnet with OS bullseye [23:52:45] (JobUnavailable) firing: (5) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:54:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2119', diff saved to https://phabricator.wikimedia.org/P39624 and previous config saved to /var/cache/conftool/dbconfig/20221114-235429-marostegui.json [23:55:31] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetdb2003.codfw.wmnet with reason: host reimage [23:56:50] PROBLEM - SSH on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [23:57:45] (JobUnavailable) firing: (5) Reduced availability for job calico-felix in aux-k8s@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:58:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetdb2003.codfw.wmnet with reason: host reimage