[00:03:54] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:02] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The following units failed: dispatch-ldap-users-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:46] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:00] (03PS1) 10Bartosz Dziewoński: Add "Clear Affordances" to DiscussionTools beta feature on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879158 (https://phabricator.wikimedia.org/T321955) [00:07:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:09:09] (03PS1) 10Bartosz Dziewoński: Add "Page Frame" to DiscussionTools beta feature on partner wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879159 (https://phabricator.wikimedia.org/T317907) [00:09:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P43048 and previous config saved to /var/cache/conftool/dbconfig/20230112-000929-marostegui.json [00:16:48] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [00:18:14] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [00:20:24] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q3): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata) [00:24:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T321391)', diff saved to https://phabricator.wikimedia.org/P43049 and previous config saved to /var/cache/conftool/dbconfig/20230112-002436-marostegui.json [00:24:38] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:24:41] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [00:24:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2174.codfw.wmnet with reason: Maintenance [00:24:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2174 (T321391)', diff saved to https://phabricator.wikimedia.org/P43050 and previous config saved to /var/cache/conftool/dbconfig/20230112-002457-marostegui.json [00:26:42] (03PS1) 10Bartosz Dziewoński: Enable visual enhancements on all talk namespaces [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879103 (https://phabricator.wikimedia.org/T325417) [00:27:00] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:27:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T321391)', diff saved to https://phabricator.wikimedia.org/P43051 and previous config saved to /var/cache/conftool/dbconfig/20230112-002721-marostegui.json [00:27:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:33:44] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:02] (03PS1) 10Ebernhardson: cirrus: Divert requests with x-public-cloud set to a dedicated pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879161 (https://phabricator.wikimedia.org/T326757) [00:35:20] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:37:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:42:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P43052 and previous config saved to /var/cache/conftool/dbconfig/20230112-004228-marostegui.json [00:42:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:44:34] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [00:47:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:50:38] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [00:52:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:57:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P43053 and previous config saved to /var/cache/conftool/dbconfig/20230112-005734-marostegui.json [00:57:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:59:55] (03CR) 10Cwhite: opensearch: make upgrade-phatality.sh stricter (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/849631 (https://phabricator.wikimedia.org/T304440) (owner: 10Hashar) [01:12:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T321391)', diff saved to https://phabricator.wikimedia.org/P43054 and previous config saved to /var/cache/conftool/dbconfig/20230112-011241-marostegui.json [01:12:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [01:12:45] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [01:12:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:12:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2176.codfw.wmnet with reason: Maintenance [01:13:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2176 (T321391)', diff saved to https://phabricator.wikimedia.org/P43055 and previous config saved to /var/cache/conftool/dbconfig/20230112-011302-marostegui.json [01:15:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T321391)', diff saved to https://phabricator.wikimedia.org/P43056 and previous config saved to /var/cache/conftool/dbconfig/20230112-011526-marostegui.json [01:17:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:27:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:28:02] PROBLEM - Check systemd state on rpki1001 is CRITICAL: CRITICAL - degraded: The following units failed: node-bgpalerter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:30:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P43057 and previous config saved to /var/cache/conftool/dbconfig/20230112-013033-marostegui.json [01:33:46] PROBLEM - Check systemd state on gitlab2002 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:46] (JobUnavailable) firing: (7) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:45:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P43058 and previous config saved to /var/cache/conftool/dbconfig/20230112-014539-marostegui.json [01:47:46] (JobUnavailable) firing: (7) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:57:12] (03CR) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [01:57:46] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:00:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T321391)', diff saved to https://phabricator.wikimedia.org/P43059 and previous config saved to /var/cache/conftool/dbconfig/20230112-020046-marostegui.json [02:00:50] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [02:02:46] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [02:16:23] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Andrew) >>! In T323324#8517754, @Dzahn wrote: > @Andrew Is it not maybe 65.19.157.35 ? Because that is... [02:17:46] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:46] (JobUnavailable) firing: (7) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:24:25] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10Slaporte) This looks good. Thank you! [02:27:46] (JobUnavailable) firing: (8) Reduced availability for job gitaly in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:50:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance [02:51:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2103.codfw.wmnet with reason: Maintenance [02:51:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1099.eqiad.wmnet with reason: Maintenance [02:51:47] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1099.eqiad.wmnet with reason: Maintenance [02:51:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1099:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43060 and previous config saved to /var/cache/conftool/dbconfig/20230112-025153-marostegui.json [02:51:57] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [02:52:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:54:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43061 and previous config saved to /var/cache/conftool/dbconfig/20230112-025417-marostegui.json [02:57:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P43062 and previous config saved to /var/cache/conftool/dbconfig/20230112-030924-marostegui.json [03:24:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P43063 and previous config saved to /var/cache/conftool/dbconfig/20230112-032430-marostegui.json [03:32:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:39:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43064 and previous config saved to /var/cache/conftool/dbconfig/20230112-033937-marostegui.json [03:39:39] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [03:39:41] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [03:39:52] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1105.eqiad.wmnet with reason: Maintenance [03:39:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43065 and previous config saved to /var/cache/conftool/dbconfig/20230112-033958-marostegui.json [03:42:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43066 and previous config saved to /var/cache/conftool/dbconfig/20230112-034221-marostegui.json [03:52:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:57:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P43067 and previous config saved to /var/cache/conftool/dbconfig/20230112-035727-marostegui.json [04:02:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:07:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:12:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311', diff saved to https://phabricator.wikimedia.org/P43068 and previous config saved to /var/cache/conftool/dbconfig/20230112-041234-marostegui.json [04:12:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:22:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:27:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3311 (T321391)', diff saved to https://phabricator.wikimedia.org/P43069 and previous config saved to /var/cache/conftool/dbconfig/20230112-042741-marostegui.json [04:27:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance [04:27:45] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [04:27:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1106.eqiad.wmnet with reason: Maintenance [04:27:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:27:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [04:27:52] 10SRE, 10Parsoid, 10vm-requests: : VMs requested for - https://phabricator.wikimedia.org/T326775 (101313) [04:27:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1106 (T321391)', diff saved to https://phabricator.wikimedia.org/P43070 and previous config saved to /var/cache/conftool/dbconfig/20230112-042757-marostegui.json [04:30:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T321391)', diff saved to https://phabricator.wikimedia.org/P43071 and previous config saved to /var/cache/conftool/dbconfig/20230112-043020-marostegui.json [04:32:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:42:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:45:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P43072 and previous config saved to /var/cache/conftool/dbconfig/20230112-044526-marostegui.json [04:57:03] (03PS32) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [04:59:18] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [04:59:52] (03CR) 10Raymond Ndibe: "Hello Bryan, I made changes to maintain_dbusers.py and maintain_dbusers.pp in attempt to address your review. Can you verify that what I d" [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [05:00:04] (03PS33) 10Raymond Ndibe: Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) [05:00:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106', diff saved to https://phabricator.wikimedia.org/P43073 and previous config saved to /var/cache/conftool/dbconfig/20230112-050033-marostegui.json [05:02:18] (03CR) 10CI reject: [V: 04-1] Modify maintain-dbusers.py to call the rest-api service [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [05:15:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1106 (T321391)', diff saved to https://phabricator.wikimedia.org/P43074 and previous config saved to /var/cache/conftool/dbconfig/20230112-051539-marostegui.json [05:15:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1107.eqiad.wmnet with reason: Maintenance [05:15:44] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [05:15:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1107.eqiad.wmnet with reason: Maintenance [05:16:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1107 (T321391)', diff saved to https://phabricator.wikimedia.org/P43075 and previous config saved to /var/cache/conftool/dbconfig/20230112-051601-marostegui.json [05:18:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T321391)', diff saved to https://phabricator.wikimedia.org/P43076 and previous config saved to /var/cache/conftool/dbconfig/20230112-051823-marostegui.json [05:33:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P43077 and previous config saved to /var/cache/conftool/dbconfig/20230112-053330-marostegui.json [05:48:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107', diff saved to https://phabricator.wikimedia.org/P43078 and previous config saved to /var/cache/conftool/dbconfig/20230112-054837-marostegui.json [06:03:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1107 (T321391)', diff saved to https://phabricator.wikimedia.org/P43079 and previous config saved to /var/cache/conftool/dbconfig/20230112-060343-marostegui.json [06:03:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance [06:03:48] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [06:03:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1119.eqiad.wmnet with reason: Maintenance [06:04:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1119 (T321391)', diff saved to https://phabricator.wikimedia.org/P43080 and previous config saved to /var/cache/conftool/dbconfig/20230112-060404-marostegui.json [06:06:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T321391)', diff saved to https://phabricator.wikimedia.org/P43081 and previous config saved to /var/cache/conftool/dbconfig/20230112-060627-marostegui.json [06:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [06:21:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P43082 and previous config saved to /var/cache/conftool/dbconfig/20230112-062134-marostegui.json [06:32:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:36:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119', diff saved to https://phabricator.wikimedia.org/P43083 and previous config saved to /var/cache/conftool/dbconfig/20230112-063640-marostegui.json [06:37:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:51:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1119 (T321391)', diff saved to https://phabricator.wikimedia.org/P43084 and previous config saved to /var/cache/conftool/dbconfig/20230112-065147-marostegui.json [06:51:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1128.eqiad.wmnet with reason: Maintenance [06:51:52] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [06:52:02] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1128.eqiad.wmnet with reason: Maintenance [06:52:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T321391)', diff saved to https://phabricator.wikimedia.org/P43085 and previous config saved to /var/cache/conftool/dbconfig/20230112-065208-marostegui.json [06:52:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:54:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T321391)', diff saved to https://phabricator.wikimedia.org/P43086 and previous config saved to /var/cache/conftool/dbconfig/20230112-065430-marostegui.json [06:57:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T0700) [07:00:04] kormat, marostegui, and Amir1: Dear deployers, time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T0700). [07:02:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:07:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:09:05] 10SRE, 10Performance-Team, 10Traffic, 10Performance Issue: en.wiki slow to respond when editing, and occasionally throws an error with Chrome search shortcuts, or blocked because missing HTTPS - https://phabricator.wikimedia.org/T326496 (10larissagaulia) 05Open→03Resolved a:03larissagaulia [07:09:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P43087 and previous config saved to /var/cache/conftool/dbconfig/20230112-070936-marostegui.json [07:22:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:23:12] (03PS1) 10Slyngshede: C:idm::deployment Add TLS termination. [puppet] - 10https://gerrit.wikimedia.org/r/879182 [07:24:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P43088 and previous config saved to /var/cache/conftool/dbconfig/20230112-072443-marostegui.json [07:26:45] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39101/console" [puppet] - 10https://gerrit.wikimedia.org/r/879182 (owner: 10Slyngshede) [07:28:28] (03PS2) 10Slyngshede: C:idm::deployment Add TLS termination. [puppet] - 10https://gerrit.wikimedia.org/r/879182 [07:29:27] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39102/console" [puppet] - 10https://gerrit.wikimedia.org/r/879182 (owner: 10Slyngshede) [07:35:32] (03CR) 10Slyngshede: "I still need to figure out how to do the CNAME in DNS." [puppet] - 10https://gerrit.wikimedia.org/r/879182 (owner: 10Slyngshede) [07:38:51] !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 112 [07:39:19] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 112 [07:39:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T321391)', diff saved to https://phabricator.wikimedia.org/P43089 and previous config saved to /var/cache/conftool/dbconfig/20230112-073949-marostegui.json [07:39:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1132.eqiad.wmnet with reason: Maintenance [07:39:53] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [07:40:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1132.eqiad.wmnet with reason: Maintenance [07:40:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T321391)', diff saved to https://phabricator.wikimedia.org/P43090 and previous config saved to /var/cache/conftool/dbconfig/20230112-074010-marostegui.json [07:40:39] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 37002 [07:41:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 37002 [07:42:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 9584 [07:42:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T321391)', diff saved to https://phabricator.wikimedia.org/P43091 and previous config saved to /var/cache/conftool/dbconfig/20230112-074232-marostegui.json [07:43:27] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 9584 [07:50:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:52:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:53:58] PROBLEM - Check systemd state on rpki2002 is CRITICAL: CRITICAL - degraded: The following units failed: node-bgpalerter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:55:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:56:14] (03CR) 10Muehlenhoff: [C: 03+2] package_builder::pbuilder_hook: Manage the hook directory with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/878879 (owner: 10Muehlenhoff) [07:57:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P43092 and previous config saved to /var/cache/conftool/dbconfig/20230112-075739-marostegui.json [07:59:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host bast5003.wikimedia.org [07:59:48] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:00:04] Amir1, apergos, and jnuche: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T0800). [08:00:34] morning! there are no trainees signed up today and no patches scheduled for the window. if any self deployers want to get something done, now's the time. [08:02:15] (03PS1) 10Ayounsi: Depool esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/879268 (https://phabricator.wikimedia.org/T316532) [08:02:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:04:41] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) On ulsfo: > I had no issues in the lab going from 14.1X53-D54.1 (It was the only available in the lab) to 19.1. (the closest version available on t... [08:05:59] (03PS1) 10Ayounsi: Add untrusted/customer/parked prefixes to bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/879269 (https://phabricator.wikimedia.org/T230600) [08:06:50] (03PS2) 10Ayounsi: Add untrusted/customer/parked prefixes to bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/879269 (https://phabricator.wikimedia.org/T230600) [08:09:51] (03CR) 10Ayounsi: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/879269/39103/rpki1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/879269 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [08:12:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P43093 and previous config saved to /var/cache/conftool/dbconfig/20230112-081245-marostegui.json [08:16:41] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast5003.wikimedia.org - jmm@cumin2002" [08:17:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:17:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM bast5003.wikimedia.org - jmm@cumin2002" [08:17:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:17:47] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache bast5003.wikimedia.org on all recursors [08:17:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) bast5003.wikimedia.org on all recursors [08:22:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:27:49] (03CR) 10Jcrespo: [C: 03+2] "Deploying after getting legal's ok- we can later tune the keywords if needed." [puppet] - 10https://gerrit.wikimedia.org/r/878010 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [08:27:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T321391)', diff saved to https://phabricator.wikimedia.org/P43094 and previous config saved to /var/cache/conftool/dbconfig/20230112-082752-marostegui.json [08:27:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1134.eqiad.wmnet with reason: Maintenance [08:28:04] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [08:28:07] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1134.eqiad.wmnet with reason: Maintenance [08:28:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T321391)', diff saved to https://phabricator.wikimedia.org/P43095 and previous config saved to /var/cache/conftool/dbconfig/20230112-082813-marostegui.json [08:31:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T321391)', diff saved to https://phabricator.wikimedia.org/P43096 and previous config saved to /var/cache/conftool/dbconfig/20230112-083135-marostegui.json [08:34:25] (03PS1) 10Effie Mouzeli: Revert "P:memcached::memkeys: install memkeys only if on buster" [puppet] - 10https://gerrit.wikimedia.org/r/879303 [08:34:25] RECOVERY - Check systemd state on people2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:36:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host bast5003.wikimedia.org [08:36:59] (03PS2) 10Effie Mouzeli: Revert "P:memcached::memkeys: install memkeys only if on buster" [puppet] - 10https://gerrit.wikimedia.org/r/879303 [08:40:43] (03PS1) 10Filippo Giunchedi: systemd: send ::syslog output to remote destination [puppet] - 10https://gerrit.wikimedia.org/r/879272 (https://phabricator.wikimedia.org/T325806) [08:41:13] (03CR) 10Effie Mouzeli: [C: 03+2] Revert "P:memcached::memkeys: install memkeys only if on buster" [puppet] - 10https://gerrit.wikimedia.org/r/879303 (owner: 10Effie Mouzeli) [08:41:27] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [08:42:37] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [08:42:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:42:51] (03CR) 10CI reject: [V: 04-1] systemd: send ::syslog output to remote destination [puppet] - 10https://gerrit.wikimedia.org/r/879272 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [08:42:54] (03PS1) 10Muehlenhoff: Add bast5003 [puppet] - 10https://gerrit.wikimedia.org/r/879273 (https://phabricator.wikimedia.org/T324974) [08:43:52] (03PS2) 10Filippo Giunchedi: systemd: send ::syslog output to remote destination [puppet] - 10https://gerrit.wikimedia.org/r/879272 (https://phabricator.wikimedia.org/T325806) [08:46:25] (03CR) 10Muehlenhoff: [C: 03+2] Add bast5003 [puppet] - 10https://gerrit.wikimedia.org/r/879273 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff) [08:46:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P43097 and previous config saved to /var/cache/conftool/dbconfig/20230112-084641-marostegui.json [08:48:39] (03CR) 10Ayounsi: [C: 03+2] Depool esams for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/879268 (https://phabricator.wikimedia.org/T316532) (owner: 10Ayounsi) [08:48:45] (03CR) 10Filippo Giunchedi: "This is the list of users of systemd::syslog as of today, for reference:" [puppet] - 10https://gerrit.wikimedia.org/r/879272 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [08:49:09] !log deployed updated patch for T311337 [08:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ping1003.eqiad.wmnet [08:50:12] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [08:50:29] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10taavi) can you try running `ferm-status` with `--verbose`? [08:50:38] !log depool esams for network maintenance - T316532 [08:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:41] T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 [08:52:24] (03PS1) 10Majavah: hieradata: add wmcs-roots to clouddumps servers [puppet] - 10https://gerrit.wikimedia.org/r/879274 [08:53:58] !log phedenskog@deploy1002 Started deploy [performance/navtiming@172cc22]: (no justification provided) [08:54:16] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@172cc22]: (no justification provided) (duration: 00m 17s) [08:54:19] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ping1003.eqiad.wmnet - jmm@cumin2002" [08:54:54] !log phedenskog@deploy1002 Started deploy [performance/navtiming@172cc22]: (no justification provided) [08:55:17] !log phedenskog@deploy1002 Finished deploy [performance/navtiming@172cc22]: (no justification provided) (duration: 00m 22s) [08:55:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ping1003.eqiad.wmnet - jmm@cumin2002" [08:55:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:55:23] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ping1003.eqiad.wmnet on all recursors [08:55:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ping1003.eqiad.wmnet on all recursors [09:00:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ping1003.eqiad.wmnet [09:01:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P43098 and previous config saved to /var/cache/conftool/dbconfig/20230112-090148-marostegui.json [09:02:21] (03PS1) 10KartikMistry: testwiki: Use Parsoid in Mediawiki Core for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879276 (https://phabricator.wikimedia.org/T323667) [09:08:48] (03PS1) 10Daniel Kinzler: Remove obsolete MWMinimalScriptInit and MEDIAWIKI_MAINT_INIT_ONLY. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879277 [09:11:17] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1039.eqiad.wmnet [09:13:21] 10SRE, 10serviceops, 10User-Elukey: Test memsniff as possible replacement of memkeys - https://phabricator.wikimedia.org/T228970 (10jijiki) For the time being, we have packaged memkeys for bullseye so not to block T293216 [09:13:32] (03PS3) 10Thiemo Kreuz (WMDE): Deprecate the EnableMapFrame feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875463 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [09:13:40] (03CR) 10CI reject: [V: 04-1] Deprecate the EnableMapFrame feature flag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875463 (https://phabricator.wikimedia.org/T326288) (owner: 10Awight) [09:14:04] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Jelto) >>! In T326649#8513274, @Papaul wrote: > @Jelto thanks for the reply i have already her SSH-key and I will personally be adding her to the group onc... [09:16:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T321391)', diff saved to https://phabricator.wikimedia.org/P43099 and previous config saved to /var/cache/conftool/dbconfig/20230112-091654-marostegui.json [09:16:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:16:59] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [09:17:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1135.eqiad.wmnet with reason: Maintenance [09:17:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T321391)', diff saved to https://phabricator.wikimedia.org/P43100 and previous config saved to /var/cache/conftool/dbconfig/20230112-091716-marostegui.json [09:17:24] (03CR) 10Jelto: admin: Add Jennifer Hancock to the datacenter-ops group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878171 (https://phabricator.wikimedia.org/T326649) (owner: 10Papaul) [09:19:16] (03PS1) 10Ilias Sarantopoulos: ml-services: multi-processing changes for articlequality and drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/879279 (https://phabricator.wikimedia.org/T323624) [09:19:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T321391)', diff saved to https://phabricator.wikimedia.org/P43101 and previous config saved to /var/cache/conftool/dbconfig/20230112-091937-marostegui.json [09:20:23] (03PS2) 10Ilias Sarantopoulos: ml-services: multi-processing changes drafttopic (load-testing) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879279 (https://phabricator.wikimedia.org/T323624) [09:21:39] RECOVERY - Check systemd state on gitlab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:36] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [09:24:54] !log jiji@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host mc1039.eqiad.wmnet [09:24:58] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1039.eqiad.wmnet [09:25:12] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc1039.eqiad.wmnet [09:26:05] (03PS1) 10Jcrespo: icinga: Add BeautifulSoap4 python dependency for check_legal [puppet] - 10https://gerrit.wikimedia.org/r/879280 (https://phabricator.wikimedia.org/T317169) [09:26:26] (03CR) 10CI reject: [V: 04-1] icinga: Add BeautifulSoap4 python dependency for check_legal [puppet] - 10https://gerrit.wikimedia.org/r/879280 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [09:27:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:28:06] (03CR) 10MVernon: [C: 03+2] hiera: move swift credentials into common [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [09:28:10] (03CR) 10MVernon: [V: 03+2 C: 03+2] hiera: move swift credentials into common [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [09:28:23] (03PS2) 10Jcrespo: icinga: Add BeautifulSoap4 python dependency for check_legal [puppet] - 10https://gerrit.wikimedia.org/r/879280 (https://phabricator.wikimedia.org/T317169) [09:28:59] (03CR) 10Klausman: [C: 03+2] ml-services: multi-processing changes drafttopic (load-testing) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879279 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [09:29:49] (03CR) 10Jcrespo: "I know you are out, but I am rather confident about this patch, so feel free to post-merge review" [puppet] - 10https://gerrit.wikimedia.org/r/879280 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [09:29:57] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [09:31:03] (03PS3) 10Jcrespo: icinga: Add BeautifulSoap4 python dependency for check_legal [puppet] - 10https://gerrit.wikimedia.org/r/879280 (https://phabricator.wikimedia.org/T317169) [09:31:07] !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 36 hosts with reason: nework maintenance [09:31:32] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 36 hosts with reason: nework maintenance [09:31:36] 10SRE, 10observability: service implementation tracking: arclamp1001.eqiad.wmnet - https://phabricator.wikimedia.org/T319434 (10MoritzMuehlenhoff) p:05Triage→03Medium [09:31:36] (03PS4) 10Jcrespo: icinga: Add BeautifulSoup4 python dependency for check_legal [puppet] - 10https://gerrit.wikimedia.org/r/879280 (https://phabricator.wikimedia.org/T317169) [09:31:41] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=43642849-a893-44f6-961e-0bb82f3a9b4e) set by ayounsi@cumin1001 for 2:00:00 on 36 host(s) an... [09:31:43] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:32:46] (JobUnavailable) firing: (6) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:34:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P43102 and previous config saved to /var/cache/conftool/dbconfig/20230112-093443-marostegui.json [09:34:47] (03Merged) 10jenkins-bot: ml-services: multi-processing changes drafttopic (load-testing) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879279 (https://phabricator.wikimedia.org/T323624) (owner: 10Ilias Sarantopoulos) [09:34:59] (03PS5) 10Jcrespo: icinga: Add BeautifulSoup4 python dependency for check_legal [puppet] - 10https://gerrit.wikimedia.org/r/879280 (https://phabricator.wikimedia.org/T317169) [09:35:22] 10SRE, 10ops-esams, 10DC-Ops, 10Infrastructure-Foundations, 10decommission-hardware: decommission atlas-esams - https://phabricator.wikimedia.org/T307026 (10ayounsi) That's a bit embarrassing, but the box came back from the dead... https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=ripe... [09:36:51] (03CR) 10MVernon: [C: 03+2] swift: move accounts_keys to common hiera global_account_keys [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [09:37:56] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:38:57] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/output/879280/39105/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/879280 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [09:39:18] (03CR) 10Jcrespo: [C: 03+2] icinga: Add BeautifulSoup4 python dependency for check_legal [puppet] - 10https://gerrit.wikimedia.org/r/879280 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [09:39:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ping2003.codfw.wmnet [09:39:38] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:41:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ping2003.codfw.wmnet - jmm@cumin2002" [09:42:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ping2003.codfw.wmnet - jmm@cumin2002" [09:42:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:42:58] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ping2003.codfw.wmnet on all recursors [09:43:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ping2003.codfw.wmnet on all recursors [09:45:53] PROBLEM - puppet last run on gitlab2002 is CRITICAL: CRITICAL: Puppet last ran 8 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:46:25] RECOVERY - Ensure legal html en.wp on en.wikipedia.org is OK: All legal html excerpts are present for https://en.wikipedia.org/wiki/Main_Page (desktop site): copyright, terms, privacy, trademark https://phabricator.wikimedia.org/project/members/28/ [09:46:28] !log redirect ns2 to authdns1001 - T316532 [09:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:31] T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 [09:47:14] !log btullis@cumin1001 Added views for new wiki: pcmwiki T310879 [09:47:14] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [09:47:17] T310879: Prepare and check storage layer for pcmwiki - https://phabricator.wikimedia.org/T310879 [09:47:46] (JobUnavailable) firing: (7) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:47:59] ACKNOWLEDGEMENT - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service Effie Mouzeli It is ok as we are retiring some components https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:59] ACKNOWLEDGEMENT - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service Effie Mouzeli It is ok as we are retiring some components https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ping2003.codfw.wmnet [09:49:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P43103 and previous config saved to /var/cache/conftool/dbconfig/20230112-094950-marostegui.json [09:50:01] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host mwmaint2002.codfw.wmnet [09:50:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ping3003.esams.wmnet [09:50:39] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:53:38] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ping3003.esams.wmnet - jmm@cumin2002" [09:54:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ping3003.esams.wmnet - jmm@cumin2002" [09:54:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:54:43] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ping3003.esams.wmnet on all recursors [09:54:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ping3003.esams.wmnet on all recursors [09:56:18] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mwmaint2002.codfw.wmnet [09:56:38] (03PS1) 10Alexandros Kosiaris: wikifunctions: Add AppArmor profile usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/879282 (https://phabricator.wikimedia.org/T326785) [09:56:51] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) 👍 {F36153930} [09:58:12] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [09:58:28] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879269 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [09:59:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ping3003.esams.wmnet [10:00:38] (03PS1) 10MVernon: hiera: remove swift accounts_keys [labs/private] - 10https://gerrit.wikimedia.org/r/879283 (https://phabricator.wikimedia.org/T162123) [10:01:12] !log reboot asw2-esams for upgrade - T316532 [10:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:16] T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 [10:01:54] (03CR) 10MVernon: "Now we've deployed the global swift credential hiera entry, remove the per-site ones to reduce confusion in future :-)" [labs/private] - 10https://gerrit.wikimedia.org/r/879283 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [10:04:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T321391)', diff saved to https://phabricator.wikimedia.org/P43104 and previous config saved to /var/cache/conftool/dbconfig/20230112-100456-marostegui.json [10:04:58] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:05:01] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [10:05:11] PROBLEM - VRRP status on cr3-esams is CRITICAL: VRRP CRITICAL - 3 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [10:05:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1139.eqiad.wmnet with reason: Maintenance [10:05:25] PROBLEM - BGP status on cr3-esams is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv4: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - Anycast, AS64600/IPv4: Connect - PyBal, AS64600/IPv4: Connect - PyBal, AS64605/IPv6: Connect - Anycast, AS64605/IPv6: Connect - [10:05:25] , AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:05:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:05:31] expected ^ [10:05:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance [10:05:47] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 76, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:05:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [10:05:59] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 57, down: 4, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:06:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1163.eqiad.wmnet with reason: Maintenance [10:06:11] PROBLEM - BFD status on cr3-esams is CRITICAL: CRIT: Down: 10 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:06:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T321391)', diff saved to https://phabricator.wikimedia.org/P43105 and previous config saved to /var/cache/conftool/dbconfig/20230112-100616-marostegui.json [10:06:21] PROBLEM - Router interfaces on cr3-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 70, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:06:37] RECOVERY - puppet last run on gitlab2002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:07:46] (JobUnavailable) firing: (7) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:08:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T321391)', diff saved to https://phabricator.wikimedia.org/P43106 and previous config saved to /var/cache/conftool/dbconfig/20230112-100839-marostegui.json [10:11:29] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_esams_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:27] RECOVERY - VRRP status on cr3-esams is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [10:12:43] PROBLEM - Host 2620:0:862:1:91:198:174:62 is DOWN: PING CRITICAL - Packet loss = 100% [10:12:43] PROBLEM - Host 2620:0:862:1:91:198:174:61 is DOWN: PING CRITICAL - Packet loss = 100% [10:12:46] (JobUnavailable) firing: (29) Reduced availability for job bird in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:12:49] (ProbeDown) firing: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:13:13] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 90, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:13:25] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:13:41] RECOVERY - BFD status on cr3-esams is OK: OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:13:49] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:13:57] RECOVERY - Host 2620:0:862:1:91:198:174:61 is UP: PING OK - Packet loss = 0%, RTA = 81.02 ms [10:14:03] RECOVERY - Host 2620:0:862:1:91:198:174:62 is UP: PING OK - Packet loss = 0%, RTA = 81.04 ms [10:14:23] RECOVERY - BGP status on cr3-esams is OK: BGP OK - up: 20, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [10:15:26] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [10:15:31] PROBLEM - Check systemd state on rpki2002 is CRITICAL: CRITICAL - degraded: The following units failed: node-bgpalerter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:51] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) 10min downtime, everything went smooth. [10:16:43] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, and 2 others: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10ayounsi) [10:17:46] (JobUnavailable) firing: (30) Reduced availability for job bird in ops@esams - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:17:49] (ProbeDown) resolved: (14) Service ncredir-https:443 has failed probes (http_ncredir-https_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:18:37] (03PS1) 10Muehlenhoff: Add ping[123]003 [puppet] - 10https://gerrit.wikimedia.org/r/879284 (https://phabricator.wikimedia.org/T273509) [10:19:17] PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:21:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/879272 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [10:23:13] (03PS2) 10Muehlenhoff: Add ping[123]003 [puppet] - 10https://gerrit.wikimedia.org/r/879284 (https://phabricator.wikimedia.org/T273509) [10:23:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P43107 and previous config saved to /var/cache/conftool/dbconfig/20230112-102345-marostegui.json [10:24:13] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:35] (03CR) 10Muehlenhoff: [C: 03+2] Add ping[123]003 [puppet] - 10https://gerrit.wikimedia.org/r/879284 (https://phabricator.wikimedia.org/T273509) (owner: 10Muehlenhoff) [10:24:58] !log rollback redirect ns2 to authdns1001 - T316532 [10:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:02] T316532: Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 [10:25:30] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-presto1001.eqiad.wmnet [10:29:03] (03PS1) 10JMeybohm: admin_ng RBAC: If-guard additional permissions [deployment-charts] - 10https://gerrit.wikimedia.org/r/879285 [10:29:34] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-presto1001.eqiad.wmnet [10:29:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/879182 (owner: 10Slyngshede) [10:29:51] RECOVERY - Check unit status of netbox_ganeti_esams_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:31:40] (03PS1) 10Jelto: gitlab: start restore job later on replicas [puppet] - 10https://gerrit.wikimedia.org/r/879406 (https://phabricator.wikimedia.org/T326315) [10:34:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [10:38:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P43108 and previous config saved to /var/cache/conftool/dbconfig/20230112-103852-marostegui.json [10:38:56] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8932 [10:39:45] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8932 [10:39:54] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 8674 [10:39:55] (03CR) 10Phedenskog: "This change is ready for review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875887 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [10:40:47] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 8674 [10:41:08] (03CR) 10Phedenskog: "We plan to test out the other metrics with fewer labels so we gonna wait with adding those now, lets try with the CPU benchmark first." [puppet] - 10https://gerrit.wikimedia.org/r/875887 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [10:41:31] !log hashar@deploy1002 Started deploy [integration/docroot@577d68a]: zuul: Link to report_url if available [10:41:46] !log hashar@deploy1002 Finished deploy [integration/docroot@577d68a]: zuul: Link to report_url if available (duration: 00m 14s) [10:42:21] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) a:03jcrespo @Xaosflux So in the end, no change of procedure is needed fo... [10:44:07] (03PS32) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [10:44:09] (03PS1) 10Jbond: base::cache: drop wikimediafoundation.org from wikimedia_domains [puppet] - 10https://gerrit.wikimedia.org/r/879409 (https://phabricator.wikimedia.org/T300977) [10:49:51] !log ayounsi@cumin1001 START - Cookbook sre.hosts.remove-downtime for 36 hosts [10:50:03] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 36 hosts [10:51:19] (03CR) 10FNegri: [C: 03+2] Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [10:53:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T321391)', diff saved to https://phabricator.wikimedia.org/P43109 and previous config saved to /var/cache/conftool/dbconfig/20230112-105358-marostegui.json [10:54:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [10:54:02] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [10:54:24] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1169.eqiad.wmnet with reason: Maintenance [10:54:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T321391)', diff saved to https://phabricator.wikimedia.org/P43110 and previous config saved to /var/cache/conftool/dbconfig/20230112-105430-marostegui.json [10:54:32] (03PS12) 10FNegri: Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) [10:56:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T321391)', diff saved to https://phabricator.wikimedia.org/P43111 and previous config saved to /var/cache/conftool/dbconfig/20230112-105652-marostegui.json [10:57:54] (03CR) 10Volans: [C: 03+1] "just rebased" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [10:58:29] (03CR) 10FNegri: [C: 03+2] Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [11:00:04] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1100). nyaa~ [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1100) [11:00:36] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] Add untrusted/customer/parked prefixes to bgpalerter [puppet] - 10https://gerrit.wikimedia.org/r/879269 (https://phabricator.wikimedia.org/T230600) (owner: 10Ayounsi) [11:04:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:11:17] PROBLEM - Check systemd state on rpki2002 is CRITICAL: CRITICAL - degraded: The following units failed: node-bgpalerter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:22] !log mwscript extensions/GlobalBlocking/maintenance/FixBlockerUsername.php --wiki metawiki "Defender" "Elton" # T298707 [11:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:26] T298707: "InvalidArgumentException: Blocker must be a local user" from GlobalBlocking - https://phabricator.wikimedia.org/T298707 [11:11:55] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3302 [11:11:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P43112 and previous config saved to /var/cache/conftool/dbconfig/20230112-111159-marostegui.json [11:12:55] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 3302 [11:13:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 3303 [11:14:20] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 3303 [11:14:51] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 25885 [11:15:05] (03PS1) 10Urbanecm: throttle: Add new rule for cswiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879412 (https://phabricator.wikimedia.org/T326792) [11:15:13] jouncebot: nowandnext [11:15:13] For the next 0 hour(s) and 44 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1100) [11:15:13] For the next 0 hour(s) and 44 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1100) [11:15:13] In 2 hour(s) and 44 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1400) [11:15:13] In 2 hour(s) and 44 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1400) [11:15:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 25885 [11:15:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879412 (https://phabricator.wikimedia.org/T326792) (owner: 10Urbanecm) [11:16:41] (03Merged) 10jenkins-bot: throttle: Add new rule for cswiki course [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879412 (https://phabricator.wikimedia.org/T326792) (owner: 10Urbanecm) [11:17:06] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:879412|throttle: Add new rule for cswiki course (T326792)]] [11:17:09] T326792: Request a throttle lift for a cswiki wiki course – 2023-01-12 - https://phabricator.wikimedia.org/T326792 [11:17:11] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:17:12] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Clement_Goubert) >>! In T265876#8512672, @Joe wrote: > We have now the logs in kafka, and thus should also be ingested in logstash, and create a dashboard. > > Once tha... [11:21:35] (03PS33) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [11:24:37] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Ingest php-slowlog in logstash - https://phabricator.wikimedia.org/T326794 (10Clement_Goubert) [11:24:53] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:879412|throttle: Add new rule for cswiki course (T326792)]] (duration: 07m 47s) [11:24:57] T326792: Request a throttle lift for a cswiki wiki course – 2023-01-12 - https://phabricator.wikimedia.org/T326792 [11:25:02] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Ingest php-slowlog in logstash - https://phabricator.wikimedia.org/T326794 (10Clement_Goubert) 05Open→03In progress p:05Triage→03Medium [11:25:09] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Clement_Goubert) [11:26:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! I'm ok to merge this as-is if it looks good to you" [puppet] - 10https://gerrit.wikimedia.org/r/875887 (https://phabricator.wikimedia.org/T321398) (owner: 10Phedenskog) [11:26:54] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [labs/private] - 10https://gerrit.wikimedia.org/r/879283 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [11:27:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P43113 and previous config saved to /var/cache/conftool/dbconfig/20230112-112705-marostegui.json [11:27:11] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [11:29:28] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [11:30:03] 10SRE, 10MW-on-K8s, 10SRE Observability, 10serviceops: Ingest php-slowlog in logstash - https://phabricator.wikimedia.org/T326794 (10Clement_Goubert) The retention of the kafka topic is currently the default 7 days. This will be reduced once logstash ingestion is setup. [11:34:56] (03PS11) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [11:34:58] (03PS1) 10Jbond: P:environment: roll out no proxy config to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) [11:35:17] (03PS12) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [11:36:14] (03CR) 10Jbond: environment: add no_proxy config directly to environment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:36:25] (03PS1) 10Majavah: admin: remove duplicate users from ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/879420 [11:36:27] (03PS1) 10Majavah: admin: add a test to prevent duplicates in users/ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/879421 [11:37:11] (03CR) 10CI reject: [V: 04-1] environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:37:15] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [11:37:40] (03PS13) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [11:39:24] (03PS2) 10KartikMistry: testwiki: Use Parsoid in Mediawiki Core for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879276 (https://phabricator.wikimedia.org/T323667) [11:39:26] (03CR) 10CI reject: [V: 04-1] environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:39:39] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [11:40:18] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/879417 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [11:40:37] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39106/console" [puppet] - 10https://gerrit.wikimedia.org/r/879417 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [11:41:59] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: sync [11:42:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T321391)', diff saved to https://phabricator.wikimedia.org/P43114 and previous config saved to /var/cache/conftool/dbconfig/20230112-114212-marostegui.json [11:42:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1176.eqiad.wmnet with reason: Maintenance [11:42:16] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [11:42:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1176.eqiad.wmnet with reason: Maintenance [11:42:44] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [11:42:44] (03PS14) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [11:42:57] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1184.eqiad.wmnet with reason: Maintenance [11:43:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1184 (T321391)', diff saved to https://phabricator.wikimedia.org/P43115 and previous config saved to /var/cache/conftool/dbconfig/20230112-114302-marostegui.json [11:45:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39110/console" [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:45:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1184 (T321391)', diff saved to https://phabricator.wikimedia.org/P43116 and previous config saved to /var/cache/conftool/dbconfig/20230112-114524-marostegui.json [11:45:31] (03PS15) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [11:49:18] (03PS16) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [11:50:45] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39112/console" [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [11:52:21] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync [11:52:23] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:54:37] !log re-seating cr2-esams fpc0 linecard - T318783 [11:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:41] T318783: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 [13:31:22] (03CR) 10Jbond: [C: 04-1] "lgtm but a few minor minor issues" [puppet] - 10https://gerrit.wikimedia.org/r/879051 (https://phabricator.wikimedia.org/T311385) (owner: 10Ayounsi) [13:36:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P43131 and previous config saved to /var/cache/conftool/dbconfig/20230112-133636-marostegui.json [13:39:45] (03PS1) 10Slyngshede: CNAME for idm-test [dns] - 10https://gerrit.wikimedia.org/r/879522 [13:45:51] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10serviceops-collab, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10hashar) The issue we had was to compare the state of the repositories between the two deployment servers. One of them had som... [13:49:47] (03PS7) 10Acamicamacaraca: Allow administrators to revoke autopatroller rights on sh.WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/871272 (https://phabricator.wikimedia.org/T325938) [13:50:45] PROBLEM - Host mc2040 is DOWN: PING CRITICAL - Packet loss = 100% [13:51:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P43132 and previous config saved to /var/cache/conftool/dbconfig/20230112-135143-marostegui.json [13:53:21] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade [13:58:06] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) Documentation/runbook: https://wikitech.wikimedia.org/wiki/Check_legal_html [14:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1400) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1400). [14:00:05] Aca and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:15] Hey! I’d like to confirm that I’m present here regarding the deployment of patch 871272 (Allow administrators to revoke autopatroller rights on sh.WP). [14:00:43] o/ I can deploy today [14:00:46] hi [14:01:06] Aca: do you have the x-wikimedia-debug browser extension installed? [14:01:26] Yep. Should I open it and which server should I select? [14:01:34] (my backport has no obvious effect that can be tested, it only affects some logging) [14:01:53] (03CR) 10Majavah: [C: 03+2] "backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879101 (owner: 10Bartosz Dziewoński) [14:02:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/871272 (https://phabricator.wikimedia.org/T325938) (owner: 10Acamicamacaraca) [14:03:01] (03Merged) 10jenkins-bot: Allow administrators to revoke autopatroller rights on sh.WP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/871272 (https://phabricator.wikimedia.org/T325938) (owner: 10Acamicamacaraca) [14:03:27] !log taavi@deploy1002 Started scap: Backport for [[gerrit:871272|Allow administrators to revoke autopatroller rights on sh.WP (T325938)]] [14:03:31] T325938: Change the configuration for revoking some rights on sh.WP - https://phabricator.wikimedia.org/T325938 [14:04:33] Aca: i'll let you know when your patch can be tested, but when it's available you can pick any of the mwdebug servers [14:04:42] (03CR) 10Jbond: [V: 03+1 C: 03+2] environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [14:05:15] !log taavi@deploy1002 taavi and aleksandar: Backport for [[gerrit:871272|Allow administrators to revoke autopatroller rights on sh.WP (T325938)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [14:05:29] Aca: your patch is now available for testing [14:05:54] Okie. I would also like to ask if I should select any of the additional options (XHGui, Verbosе) [14:06:37] no, just select a server and set the enabled switch to 'ON' [14:06:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T321391)', diff saved to https://phabricator.wikimedia.org/P43133 and previous config saved to /var/cache/conftool/dbconfig/20230112-140649-marostegui.json [14:06:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance [14:06:53] Alright [14:06:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1206.eqiad.wmnet with reason: Maintenance [14:06:54] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [14:06:57] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1040.eqiad.wmnet with OS bullseye [14:07:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1206 (T321391)', diff saved to https://phabricator.wikimedia.org/P43134 and previous config saved to /var/cache/conftool/dbconfig/20230112-140659-marostegui.json [14:07:33] (03Merged) 10jenkins-bot: Track callers of parseRevisionParsoidHtml. [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879101 (owner: 10Bartosz Dziewoński) [14:09:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T321391)', diff saved to https://phabricator.wikimedia.org/P43135 and previous config saved to /var/cache/conftool/dbconfig/20230112-140921-marostegui.json [14:09:28] (03PS1) 10Volans: cumin: set version during Debian build [software/cumin] - 10https://gerrit.wikimedia.org/r/879546 [14:10:22] Aca: hey, is it working? do you need help with anything? [14:10:37] RECOVERY - BGP status on cr3-ulsfo is OK: BGP OK - up: 89, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:10:43] Everything looks correct. I also checked the special page User group rights, and everything seems to have been updated accordingly. [14:10:56] (03PS17) 10Jbond: environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) [14:10:59] great! pushing the changes to all the servers [14:11:13] you can turn off the x-wikimedia-debug extension now, if you didn't already [14:11:31] Done [14:12:46] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39115/console" [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [14:12:53] (03CR) 10Jbond: [C: 03+2] environment: add no_proxy config directly to environment (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [14:13:01] (03CR) 10Jbond: [V: 03+1 C: 03+2] environment: add no_proxy config directly to environment [puppet] - 10https://gerrit.wikimedia.org/r/878884 (https://phabricator.wikimedia.org/T278315) (owner: 10Jbond) [14:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [14:16:58] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:871272|Allow administrators to revoke autopatroller rights on sh.WP (T325938)]] (duration: 13m 30s) [14:17:02] T325938: Change the configuration for revoking some rights on sh.WP - https://phabricator.wikimedia.org/T325938 [14:17:13] Aca: your patch is now live [14:17:18] MatmaRex: yours is up next [14:17:31] Awesome. Thank you! [14:17:34] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [14:17:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:55] MatmaRex: do you still want to test the patch on a debug server or should I sync directly [14:17:56] yw [14:18:12] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1040.eqiad.wmnet with reason: host reimage [14:18:26] taavi: up to you [14:18:40] taavi: i can verify that it doesn't break normal functionality, i guess [14:18:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:18:52] i would prefer that if it's not too hard [14:18:54] !log taavi@deploy1002 Started scap: Backport for [[gerrit:879101|Track callers of parseRevisionParsoidHtml.]] [14:19:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:19:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:20:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1001.wikimedia.org [14:20:39] !log taavi@deploy1002 taavi and matmarex: Backport for [[gerrit:879101|Track callers of parseRevisionParsoidHtml.]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [14:20:40] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1040.eqiad.wmnet with reason: host reimage [14:20:52] MatmaRex: pulled to the test servers [14:21:06] looking [14:22:24] taavi: seems good [14:22:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.699 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:22:31] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:22:39] thanks! syncing [14:23:27] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:24:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1001.wikimedia.org [14:24:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P43136 and previous config saved to /var/cache/conftool/dbconfig/20230112-142428-marostegui.json [14:26:18] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) [14:28:28] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:879101|Track callers of parseRevisionParsoidHtml.]] (duration: 09m 34s) [14:28:41] MatmaRex: done! [14:28:47] anyone have anything else to deploy? [14:28:49] thanks taavi [14:33:59] !log UTC afternoon backports done [14:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:55] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1040.eqiad.wmnet with OS bullseye [14:37:20] !log installing sqlite3 security updates on buster [14:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe1002.eqiad.wmnet [14:39:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P43137 and previous config saved to /var/cache/conftool/dbconfig/20230112-143934-marostegui.json [14:42:22] !log btullis@cumin1001 Added views for new wiki: guwwikiquote T321288 [14:42:22] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [14:42:25] T321288: Prepare and check storage layer for guwwikiquote - https://phabricator.wikimedia.org/T321288 [14:44:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe1002.eqiad.wmnet [14:49:51] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 467, down: 3, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:50:21] !log installing postgresql-11 security updates on puppetdb1002 [14:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-fe2002.codfw.wmnet [14:53:31] (03PS1) 10Effie Mouzeli: hieradata: disable maps tile_generation timers for planet import [puppet] - 10https://gerrit.wikimedia.org/r/879556 (https://phabricator.wikimedia.org/T314472) [14:54:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T321391)', diff saved to https://phabricator.wikimedia.org/P43138 and previous config saved to /var/cache/conftool/dbconfig/20230112-145441-marostegui.json [14:54:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [14:54:45] T321391: Add new column cu_log.cul_reason_id and cu_log.cul_reason_plaintext_id to wmf wikis - https://phabricator.wikimedia.org/T321391 [14:54:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [14:56:28] (03CR) 10Jgiannelos: [C: 03+1] hieradata: disable maps tile_generation timers for planet import [puppet] - 10https://gerrit.wikimedia.org/r/879556 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [14:58:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-fe2002.codfw.wmnet [15:01:55] (03CR) 10Effie Mouzeli: ipsec: remove ipsec role and the strongswan module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875897 (owner: 10Effie Mouzeli) [15:02:14] (03CR) 10Filippo Giunchedi: [C: 03+1] "Agreed! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879520 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [15:02:16] (03CR) 10Effie Mouzeli: [C: 03+2] hieradata: disable maps tile_generation timers for planet import [puppet] - 10https://gerrit.wikimedia.org/r/879556 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [15:04:16] (03CR) 10Ssingh: [C: 03+1] "LGTM! Let's coordinate the deployment when we merge this." [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [15:05:15] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1041.eqiad.wmnet [15:05:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/875897 (owner: 10Effie Mouzeli) [15:06:14] (03CR) 10Effie Mouzeli: [C: 03+2] ipsec: remove ipsec role and the strongswan module [puppet] - 10https://gerrit.wikimedia.org/r/875897 (owner: 10Effie Mouzeli) [15:06:22] (03PS4) 10Effie Mouzeli: ipsec: remove ipsec role and the strongswan module [puppet] - 10https://gerrit.wikimedia.org/r/875897 [15:06:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host dborch1001.wikimedia.org [15:06:50] (03PS1) 10Giuseppe Lavagetto: Update README [deployment-charts] - 10https://gerrit.wikimedia.org/r/879557 [15:10:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dborch1001.wikimedia.org [15:11:14] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [15:11:58] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1041.eqiad.wmnet [15:15:30] 10SRE, 10Traffic: Remove IPSec/Strongswan from Puppet repository - https://phabricator.wikimedia.org/T326745 (10BBlack) https://gerrit.wikimedia.org/r/c/operations/puppet/+/875897/ ! (apparently someone was already working on this!) [15:15:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/879522 (owner: 10Slyngshede) [15:18:55] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/cumin] - 10https://gerrit.wikimedia.org/r/879546 (owner: 10Volans) [15:18:58] (03CR) 10Svantje Lilienthal: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal) [15:20:56] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Andrew) >>! In T323324#8518687, @taavi wrote: > can you try running `ferm-status` with `--verbose`? Yea... [15:24:19] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:25:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:27:59] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [15:28:47] !log Planet import in codfw (on maps2009) started at 15:26 UTC - T314472 [15:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:51] T314472: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 [15:29:33] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [15:29:50] (03PS1) 10Stang: etwikiquote: Switch logo variant back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879561 (https://phabricator.wikimedia.org/T313698) [15:31:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/879522 (owner: 10Slyngshede) [15:34:31] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1118.eqiad.wmnet with reason: Maintenance [15:34:43] (03CR) 10Jbond: [C: 03+1] cumin: set version during Debian build [software/cumin] - 10https://gerrit.wikimedia.org/r/879546 (owner: 10Volans) [15:34:44] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1118.eqiad.wmnet with reason: Maintenance [15:35:07] (03CR) 10Volans: [C: 03+2] cumin: set version during Debian build [software/cumin] - 10https://gerrit.wikimedia.org/r/879546 (owner: 10Volans) [15:35:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) [15:35:50] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=kubernetes1014.eqiad.wmnet [15:36:06] !log btullis@cumin1001 Added views for new wiki: shnwikibooks T321256 [15:36:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [15:36:09] T321256: Prepare and check storage layer for shnwikibooks - https://phabricator.wikimedia.org/T321256 [15:38:49] (03CR) 10Hokwelum: "Hello, looks like bd808 and raymond-ndibe are still listed as wmcs-roots but are no longer on the team. Perhaps, they could be taken off t" [puppet] - 10https://gerrit.wikimedia.org/r/879274 (owner: 10Majavah) [15:41:05] (03PS2) 10Jbond: P:environment: roll out no proxy config to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) [15:42:01] (03Merged) 10jenkins-bot: cumin: set version during Debian build [software/cumin] - 10https://gerrit.wikimedia.org/r/879546 (owner: 10Volans) [15:42:21] (03CR) 10CI reject: [V: 04-1] P:environment: roll out no proxy config to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) (owner: 10Jbond) [15:44:08] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [15:46:12] (03CR) 10Herron: [C: 03+1] systemd: send ::syslog output to remote destination [puppet] - 10https://gerrit.wikimedia.org/r/879272 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [15:46:24] !log hnowlan@puppetmaster1001 conftool action : set/weight=8; selector: service=thumbor,name=kubernetes1014.eqiad.wmnet [15:47:47] !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes1014.eqiad.wmnet [15:52:54] (03PS1) 10Marostegui: install_server: Adjust new mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/879563 [15:53:42] (03CR) 10Marostegui: [C: 03+2] install_server: Adjust new mariadb hosts [puppet] - 10https://gerrit.wikimedia.org/r/879563 (owner: 10Marostegui) [16:03:57] (03PS1) 10Jbond: ssh: add new match_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/879586 (https://phabricator.wikimedia.org/T323484) [16:04:28] (03CR) 10CI reject: [V: 04-1] ssh: add new match_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/879586 (https://phabricator.wikimedia.org/T323484) (owner: 10Jbond) [16:05:02] (03PS2) 10Jbond: ssh: add new match_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/879586 (https://phabricator.wikimedia.org/T323484) [16:05:33] (03CR) 10CI reject: [V: 04-1] ssh: add new match_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/879586 (https://phabricator.wikimedia.org/T323484) (owner: 10Jbond) [16:06:03] (03PS3) 10Jbond: ssh: add new match_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/879586 (https://phabricator.wikimedia.org/T323484) [16:08:51] !log btullis@cumin1001 Added views for new wiki: bjnwiktionary T312214 [16:08:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [16:08:57] T312214: Prepare and check storage layer for bjnwiktionary - https://phabricator.wikimedia.org/T312214 [16:09:59] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39116/console" [puppet] - 10https://gerrit.wikimedia.org/r/879586 (https://phabricator.wikimedia.org/T323484) (owner: 10Jbond) [16:10:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:10:40] (03PS3) 10Jbond: P:environment: roll out no proxy config to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/879418 (https://phabricator.wikimedia.org/T300977) [16:13:12] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879586 (https://phabricator.wikimedia.org/T323484) (owner: 10Jbond) [16:14:01] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [16:14:04] (03PS4) 10Jbond: ssh: add new match_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/879586 (https://phabricator.wikimedia.org/T323484) [16:14:13] (03CR) 10Jbond: [C: 03+2] ssh: add new match_config parameter [puppet] - 10https://gerrit.wikimedia.org/r/879586 (https://phabricator.wikimedia.org/T323484) (owner: 10Jbond) [16:14:15] (03PS2) 10Vlad.shapik: WIP: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) [16:14:43] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Enable kartographer external data parse time fetch for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879559 (https://phabricator.wikimedia.org/T326317) (owner: 10Svantje Lilienthal) [16:14:45] (03PS3) 10Vlad.shapik: WIP: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) [16:15:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:15:06] (03PS4) 10Vlad.shapik: WIP: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) [16:17:05] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [16:18:34] (03CR) 10CI reject: [V: 04-1] WIP: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) (owner: 10Vlad.shapik) [16:18:36] (03PS1) 10Zabe: Stop writing to cul_user and cul_user_text on a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879590 (https://phabricator.wikimedia.org/T233004) [16:18:41] (03PS1) 10Zabe: Start writing to rev_comment_id on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879591 (https://phabricator.wikimedia.org/T299954) [16:19:42] jouncebot, nowandnext [16:19:42] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [16:19:42] In 0 hour(s) and 40 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1700) [16:20:11] (03CR) 10Zabe: [C: 03+2] Stop writing to cul_user and cul_user_text on a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879590 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [16:20:17] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [16:20:24] (03CR) 10Zabe: [C: 03+2] Start writing to rev_comment_id on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879591 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [16:21:03] (03Merged) 10jenkins-bot: Stop writing to cul_user and cul_user_text on a few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879590 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [16:21:12] (03Merged) 10jenkins-bot: Start writing to rev_comment_id on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879591 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [16:21:46] !log zabe@deploy1002 Started scap: Backport for [[gerrit:879590|Stop writing to cul_user and cul_user_text on a few wikis (T233004)]], [[gerrit:879591|Start writing to rev_comment_id on group1 wikis (T299954)]] [16:21:51] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [16:21:51] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [16:23:31] !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:879590|Stop writing to cul_user and cul_user_text on a few wikis (T233004)]], [[gerrit:879591|Start writing to rev_comment_id on group1 wikis (T299954)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [16:24:39] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:17] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [16:31:35] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:879590|Stop writing to cul_user and cul_user_text on a few wikis (T233004)]], [[gerrit:879591|Start writing to rev_comment_id on group1 wikis (T299954)]] (duration: 09m 49s) [16:31:40] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [16:31:41] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [16:34:05] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [16:36:10] (03CR) 10DCausse: [C: 03+1] cirrus: Divert requests with x-public-cloud set to a dedicated pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879161 (https://phabricator.wikimedia.org/T326757) (owner: 10Ebernhardson) [16:37:29] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:09] (03CR) 10Dzahn: "ACK, thank you Alex! I will do so to clean up." [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [16:41:17] (03Abandoned) 10Dzahn: thumbor/mwmaint: add periodic job to pull fc-list file [puppet] - 10https://gerrit.wikimedia.org/r/685914 (https://phabricator.wikimedia.org/T280718) (owner: 10Dzahn) [16:43:58] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [16:46:39] (HelmReleaseBadStatus) firing: Helm release thumbor/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:47:46] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [16:48:07] neat alert! that was my doing, fixing [16:48:33] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [16:48:36] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [16:49:39] (03PS1) 10Ryan Kemper: [WIP] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) [16:51:39] (HelmReleaseBadStatus) resolved: Helm release thumbor/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=thumbor - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:54:15] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic: ulsfo refresh scheduling - https://phabricator.wikimedia.org/T317249 (10RobH) 05Open→03Resolved [16:54:19] (03PS1) 10Stang: nlwiki: Add block right to checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879600 (https://phabricator.wikimedia.org/T326355) [16:54:33] 10SRE, 10ops-ulsfo, 10decommission-hardware: ulsfo unified decom task - https://phabricator.wikimedia.org/T321596 (10RobH) 05Open→03Resolved [16:54:38] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10RobH) [16:54:52] 10SRE, 10ops-ulsfo, 10DC-Ops: ulsfo next visit checklist - https://phabricator.wikimedia.org/T322861 (10RobH) 05Open→03Resolved [16:55:13] 10SRE, 10ops-ulsfo, 10Traffic, 10decommission-hardware: decommission dns4001 - https://phabricator.wikimedia.org/T319215 (10RobH) 05Open→03Resolved this host is now gone [16:57:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye [16:57:32] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye [16:58:24] (03PS5) 10Vlad.shapik: WIP: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) [16:59:10] (03PS6) 10Vlad.shapik: WIP: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) [16:59:17] (03PS1) 10Jcrespo: icinga:Update legal check to link to wikitech and add legal contact [puppet] - 10https://gerrit.wikimedia.org/r/879601 (https://phabricator.wikimedia.org/T317169) [17:00:05] jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:01:28] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/879601 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [17:03:16] (03CR) 10Dzahn: "thanks for your work on this, Jaime" [puppet] - 10https://gerrit.wikimedia.org/r/879601 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [17:03:53] (03PS1) 10Jbond: ssh: update match_config data structure [puppet] - 10https://gerrit.wikimedia.org/r/879602 [17:04:38] (03CR) 10Dzahn: [C: 03+1] icinga:Update legal check to link to wikitech and add legal contact [puppet] - 10https://gerrit.wikimedia.org/r/879601 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [17:04:44] (03PS1) 10Ebernhardson: looksLikeAutomation: Allow flagging requests from arbitrary headers [extensions/CirrusSearch] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879571 (https://phabricator.wikimedia.org/T326757) [17:05:19] (03CR) 10CI reject: [V: 04-1] ssh: update match_config data structure [puppet] - 10https://gerrit.wikimedia.org/r/879602 (owner: 10Jbond) [17:05:39] (03CR) 10Jcrespo: [C: 03+2] icinga:Update legal check to link to wikitech and add legal contact [puppet] - 10https://gerrit.wikimedia.org/r/879601 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [17:06:39] (03PS2) 10Jbond: ssh: update match_config data structure [puppet] - 10https://gerrit.wikimedia.org/r/879602 [17:08:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39118/console" [puppet] - 10https://gerrit.wikimedia.org/r/879602 (owner: 10Jbond) [17:08:35] (03CR) 10Dzahn: [C: 03+1] "after changes to contacts/contacgroups it's usually a good idea to run a "sudo icinga -v /etc/icinga/icinga.cfg" on the server (alert1001)" [puppet] - 10https://gerrit.wikimedia.org/r/879601 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [17:08:38] !log btullis@cumin1001 Added views for new wiki: aswikiquote T321294 [17:08:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [17:08:41] T321294: Prepare and check storage layer for aswikiquote - https://phabricator.wikimedia.org/T321294 [17:09:20] (03CR) 10Dzahn: [C: 03+1] icinga:Update legal check to link to wikitech and add legal contact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879601 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [17:09:57] (03PS1) 10Jbond: ssh::server: add validate_cmd to sshd_config [puppet] - 10https://gerrit.wikimedia.org/r/879605 [17:12:40] (03CR) 10Jcrespo: [C: 03+2] icinga:Update legal check to link to wikitech and add legal contact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879601 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [17:13:29] (03CR) 10Jcrespo: [C: 03+2] "Looks ok :-)" [puppet] - 10https://gerrit.wikimedia.org/r/879601 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [17:14:52] (03CR) 10Jcrespo: [C: 03+2] icinga:Update legal check to link to wikitech and add legal contact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879601 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [17:16:12] (03PS1) 10Ryan Kemper: [WIP] wdqs: use pre-computed wdqs recording rules [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/879606 (https://phabricator.wikimedia.org/T323064) [17:16:21] (03PS2) 10Jbond: ssh::server: add validate_cmd to sshd_config [puppet] - 10https://gerrit.wikimedia.org/r/879605 [17:17:10] (03PS2) 10Ryan Kemper: [WIP] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) [17:17:34] 10SRE, 10Observability-Alerting, 10WMF-Legal, 10WikimediaMessages, and 2 others: Find the right procedure to update wiki footers (was en.wikibooks.org has changed legal footer) - https://phabricator.wikimedia.org/T317169 (10jcrespo) 05Open→03Resolved So the updated alarm has been deployed. Now the tick... [17:18:29] (03CR) 10Dzahn: [C: 03+1] icinga:Update legal check to link to wikitech and add legal contact (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879601 (https://phabricator.wikimedia.org/T317169) (owner: 10Jcrespo) [17:22:16] (03CR) 10Dzahn: [C: 03+1] gitlab: start restore job later on replicas [puppet] - 10https://gerrit.wikimedia.org/r/879406 (https://phabricator.wikimedia.org/T326315) (owner: 10Jelto) [17:32:10] (03CR) 10Eevans: [C: 03+1] swift: disable swifrepl timer job [puppet] - 10https://gerrit.wikimedia.org/r/879520 (https://phabricator.wikimedia.org/T299125) (owner: 10MVernon) [17:32:19] (03PS1) 10Volans: CHANGELOG: add changelogs for release v4.2.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/879612 [17:42:17] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v4.2.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/879612 (owner: 10Volans) [17:42:51] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10RobH) [17:44:25] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q1:rack/setup/install cp40[37-51] - https://phabricator.wikimedia.org/T317244 (10ssingh) [17:45:21] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: add HBA355i support to installer - https://phabricator.wikimedia.org/T319067 (10ssingh) 05Open→03Resolved I think we can close this task and mark it as resolved. The original purpose for which this was required has now been met and fut... [17:45:27] !log powercycling mc2040 via mgmt ocnsole [17:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:43] RECOVERY - Host mc2040 is UP: PING OK - Packet loss = 0%, RTA = 31.61 ms [17:49:09] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v4.2.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/879612 (owner: 10Volans) [17:54:26] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.codfw.wmnet with OS bullseye [17:54:31] 10SRE, 10ops-codfw: codfw:test new Supermicro server - https://phabricator.wikimedia.org/T322578 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host sretest2002.codfw.wmnet with OS bullseye executed with errors: - sretest2002 (**FAIL**) - Downtimed on Icinga/Alert... [17:58:53] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:00:04] bd808: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Technical Engagement weekly deploy (Toolhub, Developer portal, Striker) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1800). [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1800) [18:03:17] 10SRE, 10serviceops: Memcached, mcrouter in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711 (10jijiki) a:05Joe→03jijiki [18:03:24] 10ops-eqiad, 10DC-Ops: hw troubleshooting: DIMM_B2 for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T326834 (10jijiki) [18:07:27] (03PS1) 10Ottomata: flink-kubernetes-operator - allow flink-app pods to talk to k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/879618 (https://phabricator.wikimedia.org/T324576) [18:08:10] 10ops-eqiad, 10DC-Ops: hw troubleshooting: DIMM_B2 for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T326834 (10Papaul) @jijiki mc2040 that is codfw not eqiad [18:08:20] 10ops-codfw, 10DC-Ops: hw troubleshooting: DIMM_B2 for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T326834 (10Dzahn) [18:08:50] 10ops-codfw, 10DC-Ops: hw troubleshooting: DIMM_B2 for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T326834 (10Clement_Goubert) From what I understand, you can work on it any time, and we don't need to depool it. We may want to downtime it before y'all work on it. [18:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [18:16:05] 10ops-codfw, 10DC-Ops: hw troubleshooting: DIMM_B2 for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T326834 (10Jhancock.wm) @Clement_Goubert can you downtime the server? Please let me know when I can work on the server. [18:17:11] (03PS1) 10Volans: Upstream release v4.2.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/879620 [18:17:46] (JobUnavailable) firing: (2) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:18:44] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mc2040.codfw.wmnet with reason: hardware troubleshooting [18:19:09] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mc2040.codfw.wmnet with reason: hardware troubleshooting [18:19:14] 10ops-codfw, 10DC-Ops: hw troubleshooting: DIMM_B2 for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T326834 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4016a17a-817d-4d48-be1d-b36713ff2632) set by cgoubert@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services with reaso... [18:19:29] 10ops-codfw, 10DC-Ops: hw troubleshooting: DIMM_B2 for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T326834 (10Clement_Goubert) @Jhancock.wm Done. [18:35:15] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:35:46] !log stat1007 - systemctl reset-failed - clears Icinga alerts [18:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:57] (03CR) 10Herron: "nice! please see a few comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [18:36:40] !log stat1008 - systemctl reset-failed - clears Icinga alerts from failed things of the past [18:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:15] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:38:32] 10SRE, 10ops-codfw, 10DC-Ops: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10jijiki) a:05Jclark-ctr→03None [18:40:13] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: DIMM_B2 for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T326834 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm @Clement_Goubert Thank you. We powered down and swapped the A2 and B2 DIMM to see if the error carries over. as of right now w... [18:40:40] 10SRE, 10ops-codfw, 10DC-Ops: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10Papaul) a:03Jhancock.wm [18:41:36] 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10RobH) [18:41:51] 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10RobH) [18:43:00] 10SRE-swift-storage, 10serviceops: serviceops implementation tracking for ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326847 (10RobH) [18:44:34] 10SRE, 10ops-codfw, 10DC-Ops: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10Papaul) Hello can someone please confirm that those servers are ready for decom since they are are all active in Netbox . Thanks [18:45:41] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10RobH) [18:46:17] 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10RobH) [18:47:17] 10SRE-swift-storage, 10serviceops: serviceops implementation tracking for ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326849 (10RobH) [19:00:05] jeena and dduvall: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T1900). [19:02:45] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879622 (https://phabricator.wikimedia.org/T325581) [19:02:47] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879622 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot) [19:03:06] 10SRE, 10Traffic, 10Patch-For-Review: haproxy: work on systemd unit hardening (cp hosts) - https://phabricator.wikimedia.org/T323944 (10ssingh) The hardened haproxy unit has been running for a while on traffic-cache-bullseye.traffic.eqiad1.wikimedia.cloud without any issues. Pending any further comments or i... [19:03:23] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879622 (https://phabricator.wikimedia.org/T325581) (owner: 10TrainBranchBot) [19:09:49] (03PS1) 10JHathaway: facter block_devices support containers [puppet] - 10https://gerrit.wikimedia.org/r/879624 [19:10:17] (03CR) 10JHathaway: "kindly review!" [puppet] - 10https://gerrit.wikimedia.org/r/879624 (owner: 10JHathaway) [19:11:09] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.18 refs T325581 [19:11:13] PROBLEM - SSH on stat1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:11:13] T325581: 1.40.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T325581 [19:12:30] ACKNOWLEDGEMENT - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn singtel maintenance https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:12:30] ACKNOWLEDGEMENT - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: daniel_zahn singtel maintenance https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:14:39] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-xcollazo-singleuser-conda-analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:15:55] RECOVERY - SSH on stat1004 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:16:15] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:58] (03CR) 10BCornwall: [V: 03+1 C: 03+2] varnish: Template out thread pool settings [puppet] - 10https://gerrit.wikimedia.org/r/878201 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [19:52:42] (03PS1) 10Marostegui: instances.yaml: Add db1176 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/879630 (https://phabricator.wikimedia.org/T326116) [19:53:30] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1176 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/879630 (https://phabricator.wikimedia.org/T326116) (owner: 10Marostegui) [19:55:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1176 (mariadb 11) to dbctl, depooled T326116', diff saved to https://phabricator.wikimedia.org/P43146 and previous config saved to /var/cache/conftool/dbconfig/20230112-195514-marostegui.json [19:55:20] T326116: Package and test MariaDB 11 - https://phabricator.wikimedia.org/T326116 [19:56:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1176 to LB with just 1% weight T326116', diff saved to https://phabricator.wikimedia.org/P43147 and previous config saved to /var/cache/conftool/dbconfig/20230112-195651-marostegui.json [19:59:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1176 T326116', diff saved to https://phabricator.wikimedia.org/P43148 and previous config saved to /var/cache/conftool/dbconfig/20230112-195922-marostegui.json [20:06:09] 10SRE: Number of mw swift objects in eqiad greater than codfw - https://phabricator.wikimedia.org/T326857 (10andrea.denisse) [20:06:24] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Q3:rack/setup/install ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326848 (10RobH) [20:06:38] 10SRE-swift-storage, 10serviceops: serviceops implementation tracking for ms-fe2013 - ms-fe2014, thanos-fe2004 - https://phabricator.wikimedia.org/T326849 (10RobH) 05Open→03Invalid actually data persistence this was a mis categorization [20:07:05] 10SRE: Number of mw swift objects in eqiad greater than codfw - https://phabricator.wikimedia.org/T326857 (10andrea.denisse) [20:07:22] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10serviceops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10RobH) [20:07:36] 10SRE-swift-storage, 10serviceops: serviceops implementation tracking for ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326847 (10RobH) 05Open→03Invalid in valid this is actually data persistence i had it mislabeled [20:07:40] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10RobH) [20:08:08] !log Setting thread_pool_max for varnish-frontend to 12000 [20:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:50] ACKNOWLEDGEMENT - Number of mw swift objects in eqiad greater than codfw on alert1001 is CRITICAL: account=mw-media class=thumb Andrea Denisse https://phabricator.wikimedia.org/T326857 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?var-DC=eqiad [20:11:53] 10SRE, 10Traffic: Remove IPSec/Strongswan from Puppet repository - https://phabricator.wikimedia.org/T326745 (10BCornwall) 05Open→03Resolved It's merged, so I guess this can be closed. :) [20:17:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:19:24] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q3): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata) [20:21:34] 10SRE-OnFire, 10SRE Observability (FY2022/2023-Q3): implementing an incident response workflow automation tool for SRE - https://phabricator.wikimedia.org/T308467 (10lmata) [20:33:58] (03PS3) 10Ryan Kemper: [WIP] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) [20:36:05] (03CR) 10CI reject: [V: 04-1] [WIP] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [20:36:29] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:37:31] (03PS12) 10Ryan Kemper: wdqs-data-reload: use NFS for data reloads [cookbooks] - 10https://gerrit.wikimedia.org/r/876217 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [20:37:45] (03PS13) 10Ryan Kemper: wdqs: use NFS for data reloads [cookbooks] - 10https://gerrit.wikimedia.org/r/876217 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [20:38:01] PROBLEM - Query Service HTTP Port on wdqs1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:39:19] (03Abandoned) 10Ryan Kemper: elastic: decom elastic2035 [puppet] - 10https://gerrit.wikimedia.org/r/759637 (https://phabricator.wikimedia.org/T316729) (owner: 10Ryan Kemper) [20:39:31] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.070 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:39:37] RECOVERY - Query Service HTTP Port on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:49:07] (03PS4) 10Ryan Kemper: team-search-platform: relax kafka burrow check [alerts] - 10https://gerrit.wikimedia.org/r/868234 (owner: 10DCausse) [20:49:47] (03CR) 10Ryan Kemper: [C: 03+1] wdqs: use NFS for data reloads [cookbooks] - 10https://gerrit.wikimedia.org/r/876217 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [20:49:51] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: use NFS for data reloads [cookbooks] - 10https://gerrit.wikimedia.org/r/876217 (https://phabricator.wikimedia.org/T323096) (owner: 10Bking) [20:55:03] (03PS1) 10Herron: grafana: stop testing home.json dashboard [puppet] - 10https://gerrit.wikimedia.org/r/879644 [20:56:06] (03CR) 10Herron: [C: 03+2] grafana: stop testing home.json dashboard [puppet] - 10https://gerrit.wikimedia.org/r/879644 (owner: 10Herron) [20:56:37] (03PS4) 10Herron: [WIP] wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [21:00:04] brennen and TheresNoTime: That opportune time is upon us again. Time for a UTC late backport and config training deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230112T2100). [21:00:05] samwilson, koi, and ebernhardson: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:01:06] brennen TheresNoTime hello I'm present. [21:01:09] (03PS1) 10Ottomata: Add dummy an-launcher1002.eqiad.wmnet/analytics-platform-eng keytab [labs/private] - 10https://gerrit.wikimedia.org/r/879646 (https://phabricator.wikimedia.org/T326827) [21:01:17] \o [21:01:48] ahoy all! [21:01:54] (03CR) 10Herron: [C: 03+2] "followed this up with I94b7c3400a7d493e30b5ab03504d08cbc3aca8a3 since CI was failing Grafana changes with 'FileNotFoundError: [Errno 2] No" [puppet] - 10https://gerrit.wikimedia.org/r/871290 (https://phabricator.wikimedia.org/T307465) (owner: 10Majavah) [21:02:35] (03CR) 10Ryan Kemper: [C: 03+1] team-search-platform: relax kafka burrow check [alerts] - 10https://gerrit.wikimedia.org/r/868234 (owner: 10DCausse) [21:02:37] (03CR) 10Ryan Kemper: [C: 03+2] team-search-platform: relax kafka burrow check [alerts] - 10https://gerrit.wikimedia.org/r/868234 (owner: 10DCausse) [21:02:41] I can be deploy tribute today [21:03:27] thanks [21:03:46] (03Merged) 10jenkins-bot: team-search-platform: relax kafka burrow check [alerts] - 10https://gerrit.wikimedia.org/r/868234 (owner: 10DCausse) [21:04:10] (03PS6) 10Thcipriani: Remove Beta Feature for Realtime Preview and enable on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868816 (https://phabricator.wikimedia.org/T323033) (owner: 10Samwilson) [21:04:32] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868816 (https://phabricator.wikimedia.org/T323033) (owner: 10Samwilson) [21:04:40] (03PS1) 10Ottomata: Add analytics-platform-eng-admins and system user keytab to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/879648 (https://phabricator.wikimedia.org/T326827) [21:05:03] ebernhardson: any particular order for yours? [21:05:19] (03Merged) 10jenkins-bot: Remove Beta Feature for Realtime Preview and enable on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/868816 (https://phabricator.wikimedia.org/T323033) (owner: 10Samwilson) [21:05:35] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:868816|Remove Beta Feature for Realtime Preview and enable on plwiki (T323033)]] [21:05:37] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Add dummy an-launcher1002.eqiad.wmnet/analytics-platform-eng keytab [labs/private] - 10https://gerrit.wikimedia.org/r/879646 (https://phabricator.wikimedia.org/T326827) (owner: 10Ottomata) [21:05:39] thcipriani: shouldn't matter, although the first config patch doesn't do anything until the wmf.18 patch is deployed [21:05:39] T323033: Graduate Realtime Preview feature from Beta to being available for everyone - https://phabricator.wikimedia.org/T323033 [21:06:04] no koi no stang :\ [21:06:13] o/ [21:06:23] (03CR) 10Thcipriani: [C: 03+2] looksLikeAutomation: Allow flagging requests from arbitrary headers [extensions/CirrusSearch] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879571 (https://phabricator.wikimedia.org/T326757) (owner: 10Ebernhardson) [21:06:31] sorry for the delay, I mute the sound 0 0 [21:07:03] (03CR) 10Ottomata: "This will grant some sudo perms on an-launcher1002:" [puppet] - 10https://gerrit.wikimedia.org/r/879648 (https://phabricator.wikimedia.org/T326827) (owner: 10Ottomata) [21:07:04] oh hey cirno [21:07:08] !log thcipriani@deploy1002 thcipriani and samwilson: Backport for [[gerrit:868816|Remove Beta Feature for Realtime Preview and enable on plwiki (T323033)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:07:24] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39120/console" [puppet] - 10https://gerrit.wikimedia.org/r/879648 (https://phabricator.wikimedia.org/T326827) (owner: 10Ottomata) [21:07:31] testing now [21:07:35] samwilson: your change should be on mwdebug, ch...cool :) [21:09:32] thcipriani: hehe :) I like the new message, and it's now on all debug servers? cool. And yep, tested and all looks grand. Am happy for it to proceed. [21:10:07] samwilson: great, going live everywhere now, thanks for checking :) [21:10:16] and, yeah, all debug servers [21:16:19] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:868816|Remove Beta Feature for Realtime Preview and enable on plwiki (T323033)]] (duration: 10m 43s) [21:16:23] T323033: Graduate Realtime Preview feature from Beta to being available for everyone - https://phabricator.wikimedia.org/T323033 [21:16:31] samwilson: ^ should be live now [21:16:42] thanks! checking. [21:16:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879561 (https://phabricator.wikimedia.org/T313698) (owner: 10Stang) [21:17:08] ^ cirno getting your first one staged now [21:17:38] (03Merged) 10jenkins-bot: etwikiquote: Switch logo variant back [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879561 (https://phabricator.wikimedia.org/T313698) (owner: 10Stang) [21:17:53] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:879561|etwikiquote: Switch logo variant back (T313698)]] [21:17:57] T313698: Requesting temporary logo change for et.wikiquote.org - https://phabricator.wikimedia.org/T313698 [21:19:24] thcipriani: Tested everywhere and all is well. Thanks! [21:19:28] !log thcipriani@deploy1002 thcipriani and stang: Backport for [[gerrit:879561|etwikiquote: Switch logo variant back (T313698)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:19:41] samwilson: nice, thanks for checking :) [21:19:53] cirno: your first change is live on mwdebug, check please [21:19:56] looking [21:20:09] it works [21:21:18] cool going live [21:21:37] (03Merged) 10jenkins-bot: looksLikeAutomation: Allow flagging requests from arbitrary headers [extensions/CirrusSearch] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879571 (https://phabricator.wikimedia.org/T326757) (owner: 10Ebernhardson) [21:21:56] ^ ebernhardson we'll get your wmf.18 one next since it just merged [21:22:22] kk [21:23:27] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [21:24:57] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [21:27:19] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:879561|etwikiquote: Switch logo variant back (T313698)]] (duration: 09m 25s) [21:27:22] T313698: Requesting temporary logo change for et.wikiquote.org - https://phabricator.wikimedia.org/T313698 [21:27:54] ^ cirno first one should be live, I'm going to do a quick mediawiki backport, then hop back to your config patches [21:28:07] got it [21:28:37] 10Puppet, 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): clouddumps1002: ferm is being started on every puppet run - https://phabricator.wikimedia.org/T323324 (10Dzahn) >>! In T323324#8518330, @Andrew wrote: >>>! In T323324#8517754, @Dzahn wrote: >> @Andrew Is it no... [21:28:37] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:879571|looksLikeAutomation: Allow flagging requests from arbitrary headers (T326757)]] [21:28:41] T326757: Investigate doubling of full_text search query rate since jan 1, 2023 - https://phabricator.wikimedia.org/T326757 [21:30:21] !log thcipriani@deploy1002 thcipriani and ebernhardson: Backport for [[gerrit:879571|looksLikeAutomation: Allow flagging requests from arbitrary headers (T326757)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:30:43] ^ ebernhardson the wmf.18 change should be live on mwdebug, check please [21:31:12] thcipriani: basic tests look to work (it does nothing until configured) [21:31:21] meaning nothing appears broken :) [21:31:36] :) [21:31:40] sounds good, going live [21:33:17] cirno: I'm going to skip 876196 for the time being since it requires table creation. I'd like someone who is more familiar with how we do that nowadays to take a look at that :) But I'll get 879600 done after this. [21:33:46] ok, I'll reschedule this patch [21:34:55] <3 [21:36:28] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10serviceops-collab, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10Dzahn) >>! In T309162#8519453, @hashar wrote: > That is what this task is about: remove repos from the deployment servers w... [21:37:47] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:879571|looksLikeAutomation: Allow flagging requests from arbitrary headers (T326757)]] (duration: 09m 10s) [21:37:51] T326757: Investigate doubling of full_text search query rate since jan 1, 2023 - https://phabricator.wikimedia.org/T326757 [21:38:09] ^ ebernhardson we'll get the rest of yours done all together [21:38:27] (03PS2) 10Thcipriani: nlwiki: Add block right to checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879600 (https://phabricator.wikimedia.org/T326355) (owner: 10Stang) [21:38:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879600 (https://phabricator.wikimedia.org/T326355) (owner: 10Stang) [21:38:57] (after this one :)) [21:39:30] (03Merged) 10jenkins-bot: nlwiki: Add block right to checkuser group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879600 (https://phabricator.wikimedia.org/T326355) (owner: 10Stang) [21:39:42] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:879600|nlwiki: Add block right to checkuser group (T326355)]] [21:39:46] T326355: Assign block rights to the checkuser group on nl.wikipedia.org - https://phabricator.wikimedia.org/T326355 [21:39:46] kk [21:41:19] !log thcipriani@deploy1002 thcipriani and stang: Backport for [[gerrit:879600|nlwiki: Add block right to checkuser group (T326355)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [21:41:40] ^ cirno last one, check please (if you're able) [21:42:25] thcipriani, I checked https://nl.wikipedia.org/wiki/Special:Listgrouprights and LGTM [21:42:44] cool, thanks, going live :) [21:45:06] (03PS5) 10Ryan Kemper: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) [21:45:15] (03CR) 10Ryan Kemper: wdqs: add recording rule for req success ratio (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [21:45:37] (03PS6) 10Ryan Kemper: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) [21:47:09] (03PS1) 10Dreamy Jazz: Start writing to cul_comment_id and cul_comment_plaintext_id on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879652 (https://phabricator.wikimedia.org/T233004) [21:48:47] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:879600|nlwiki: Add block right to checkuser group (T326355)]] (duration: 09m 04s) [21:48:51] T326355: Assign block rights to the checkuser group on nl.wikipedia.org - https://phabricator.wikimedia.org/T326355 [21:49:05] ^ cirno should be live now [21:49:22] (03PS2) 10Thcipriani: cirrus: Divert requests with x-public-cloud set to a dedicated pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879161 (https://phabricator.wikimedia.org/T326757) (owner: 10Ebernhardson) [21:49:25] thanks! [21:49:49] (03PS2) 10Thcipriani: cirrus: Disable incoming link counting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862343 (https://phabricator.wikimedia.org/T317023) (owner: 10Ebernhardson) [21:50:37] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10serviceops-collab, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10Krinkle) >>! In T309162#7958771, @Dzahn wrote: > Top ten oldest repos by modifiation time, oldest first: > > ` > May 30 201... [21:51:06] ebernhardson: I think these two patches are going to merge conflict :) [21:51:24] (03CR) 10Thcipriani: [C: 03+2] cirrus: Divert requests with x-public-cloud set to a dedicated pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879161 (https://phabricator.wikimedia.org/T326757) (owner: 10Ebernhardson) [21:52:06] (03Merged) 10jenkins-bot: cirrus: Divert requests with x-public-cloud set to a dedicated pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879161 (https://phabricator.wikimedia.org/T326757) (owner: 10Ebernhardson) [21:52:10] (03PS2) 10Dreamy Jazz: Start writing to cul_reason_id and cul_reason_plaintext_id on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879652 (https://phabricator.wikimedia.org/T233004) [21:52:16] (03PS3) 10Dreamy Jazz: Start writing to cul_reason_id and cul_reason_plaintext_id on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879652 (https://phabricator.wikimedia.org/T233004) [21:52:52] ebernhardson: yeep, merge conflict, could you update https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/862343 for me? [21:53:00] sure, sec [21:54:05] ebernhardson: thcipriani: fwiw https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/879161/2/wmf-config/InitialiseSettings.php is slightly non-alphabetical order, might want to fix that while fixing the conflict [21:54:08] (03PS3) 10Ebernhardson: cirrus: Disable incoming link counting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862343 (https://phabricator.wikimedia.org/T317023) [21:54:32] (03PS4) 10Dreamy Jazz: Start writing to cul_reason_id and cul_reason_plaintext_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879652 (https://phabricator.wikimedia.org/T233004) [21:54:59] well, I guess I'm assuming it should be alphabetical, I'm not actually sure :D [21:55:20] ryankemper: sadly, it only happens to look alphabetical in that little snippet, there isn't a particular ordering except new things at the end of the other cirrus things [21:55:23] :D [21:55:33] checks out :P [21:55:36] 10SRE, 10ops-eqiad, 10DC-Ops: Q2:rack/setup/install puppetdb1003 - https://phabricator.wikimedia.org/T317892 (10Papaul) 05Open→03Resolved [21:56:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by thcipriani@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862343 (https://phabricator.wikimedia.org/T317023) (owner: 10Ebernhardson) [21:56:20] thcipriani: the incoming links one isn't testable, it only executes on the job runners [21:56:45] (03Merged) 10jenkins-bot: cirrus: Disable incoming link counting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/862343 (https://phabricator.wikimedia.org/T317023) (owner: 10Ebernhardson) [21:56:49] !log run populateCucComment.php on testwiki # T233004 [21:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:52] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:56:58] !log thcipriani@deploy1002 Started scap: Backport for [[gerrit:879161|cirrus: Divert requests with x-public-cloud set to a dedicated pool counter (T326757)]], [[gerrit:862343|cirrus: Disable incoming link counting (T317023)]] [21:57:02] T317023: Investigate moving incoming_links computation to a batch job - https://phabricator.wikimedia.org/T317023 [21:57:03] T326757: Investigate doubling of full_text search query rate since jan 1, 2023 - https://phabricator.wikimedia.org/T326757 [21:58:05] !log krinkle@deploy1002 Installing scap version "4.32.0" for 1 hosts [21:58:15] !log krinkle@deploy1002 Installation of scap version "4.32.0" completed for 1 hosts [21:58:33] !log thcipriani@deploy1002 thcipriani and ebernhardson: Backport for [[gerrit:879161|cirrus: Divert requests with x-public-cloud set to a dedicated pool counter (T326757)]], [[gerrit:862343|cirrus: Disable incoming link counting (T317023)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:58:46] !log krinkle@deploy1002 Installing scap version "4.32.0" for 1 hosts [21:58:56] !log krinkle@deploy1002 Installation of scap version "4.32.0" completed for 1 hosts [21:59:16] ^ ebernhardson both of the configs are live on mwdebug [21:59:29] check please :) [21:59:34] !log krinkle@deploy1002$ `scap install-world -v --limit-hosts` for webperf1003.eqiad and webperf2003.codfw, ref T326668 [21:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:37] T326668: Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 [21:59:42] !log krinkle@deploy1002 Started deploy [performance/navtiming@172cc22]: (no justification provided) [21:59:51] !log krinkle@deploy1002 Finished deploy [performance/navtiming@172cc22]: (no justification provided) (duration: 00m 08s) [22:00:09] thcipriani: everything looks reasonable [22:00:17] cool, going live [22:01:40] (03PS1) 10Dreamy Jazz: Start writing to cul_reason_[plaintext]_id on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879653 (https://phabricator.wikimedia.org/T233004) [22:02:00] (03PS5) 10Dreamy Jazz: Start writing to cul_reason_id and cul_reason_plaintext_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879652 (https://phabricator.wikimedia.org/T233004) [22:02:07] (03PS2) 10Dreamy Jazz: Start writing to cul_reason_[plaintext]_id on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879653 (https://phabricator.wikimedia.org/T233004) [22:06:22] !log thcipriani@deploy1002 Finished scap: Backport for [[gerrit:879161|cirrus: Divert requests with x-public-cloud set to a dedicated pool counter (T326757)]], [[gerrit:862343|cirrus: Disable incoming link counting (T317023)]] (duration: 09m 23s) [22:06:27] T317023: Investigate moving incoming_links computation to a batch job - https://phabricator.wikimedia.org/T317023 [22:06:27] T326757: Investigate doubling of full_text search query rate since jan 1, 2023 - https://phabricator.wikimedia.org/T326757 [22:06:34] ^ ebernhardson alright, all should be live now [22:07:04] thcipriani: thanks! already seeing them working in dashboards [22:07:09] nice :) [22:07:13] !log end UTC late backport [22:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:54] !log start of "foreachwikiindblist s3.dblist extensions/CheckUser/maintenance/populateCucComment.php" in a screen in mwmaint1002 # T233004 [22:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:57] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [22:15:15] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [22:31:45] PROBLEM - Check systemd state on people2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:38:51] ^ yea, that server should not have the auto restart service.. [22:39:00] it doesnt have rsync on it [22:41:04] (03CR) 10Herron: "LGTM overall, please see minor syntax issue inline" [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [22:45:06] (03CR) 10Herron: "annnnd one more thing 😇" [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [22:45:19] !log people2002 - apt-get remove --purge rsync [22:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:26] no, it's different, rsync package is installed..but someone or something deleted the config [22:46:36] letting puppet recreate [22:54:23] 10SRE: rsync server on people2002 - https://phabricator.wikimedia.org/T326888 (10Dzahn) [22:55:01] Hey all - was going to deploy some changes to PrivateSettings.php - let me know if I shouldn’t for any reason. [22:55:15] (03CR) 10Dzahn: [C: 03+2] "this worked after merge but today it's like the config for rsyncd has been wiped - https://phabricator.wikimedia.org/T326888" [puppet] - 10https://gerrit.wikimedia.org/r/875806 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [22:56:19] ACKNOWLEDGEMENT - Check systemd state on people2002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_rsync.service daniel_zahn https://phabricator.wikimedia.org/T326888 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:57:17] 10SRE: rsync server on people2002 - https://phabricator.wikimedia.org/T326888 (10Dzahn) restart service was only added recently, but I had tested it and it did not have that problem a couple days ago. https://gerrit.wikimedia.org/r/875806 [22:58:37] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10serviceops-collab, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10Dzahn) I don't remember exactly but most likely find -mtime, yea. ACK! [23:06:59] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:08:29] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:08:34] !log Deployed (temporary) security mitigations for T326691 [23:08:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:08:46] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: DIMM_B2 for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T326834 (10Dzahn) @Jhancock.wm Thanks for the super quick turnaround. That was fast, wow. someone needs to follow-up, for example do we set the status back to active in netbox, does it have... [23:10:34] 10SRE, 10ops-codfw, 10DC-Ops: hw troubleshooting: DIMM_B2 for mc2040.codfw.wmnet - https://phabricator.wikimedia.org/T326834 (10Dzahn) set back to active in netbox [23:10:38] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@99a3e6f]: import_cirrus_index: use spark3 [23:13:10] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@99a3e6f]: import_cirrus_index: use spark3 (duration: 02m 31s) [23:19:21] (03PS1) 10Jdlrobson: English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 [23:22:47] (03PS1) 10BCornwall: prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) [23:22:59] (03PS1) 10Jdlrobson: [Just in case] Disable thumbnails on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879661 [23:23:09] (03CR) 10CI reject: [V: 04-1] prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [23:24:23] (03PS2) 10BCornwall: prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) [23:24:43] (03CR) 10CI reject: [V: 04-1] prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [23:25:47] (03PS3) 10BCornwall: prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) [23:26:08] (03CR) 10CI reject: [V: 04-1] prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [23:26:39] (03CR) 10Jdlrobson: [C: 04-1] "FYI" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879661 (owner: 10Jdlrobson) [23:29:55] (03PS1) 10Jdlrobson: Remove redundant block for search descriptions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879664 (https://phabricator.wikimedia.org/T324859) [23:30:34] (03PS4) 10BCornwall: prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) [23:31:18] (03PS7) 10Ryan Kemper: wdqs: add recording rule for req success ratio [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) [23:32:40] (03CR) 10CI reject: [V: 04-1] prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [23:32:52] (03CR) 10Ryan Kemper: "done! thanks for catching those" [puppet] - 10https://gerrit.wikimedia.org/r/879599 (https://phabricator.wikimedia.org/T323064) (owner: 10Ryan Kemper) [23:38:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) [23:40:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to datacenter-ops for Jennifer Hancock - https://phabricator.wikimedia.org/T326649 (10Papaul) 05Open→03Resolved I tested this today with @Jhancock.wm all is working. We can close the task. Thanks @Jelto @Dzahn [23:41:44] (03PS5) 10BCornwall: prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) [23:43:46] (03CR) 10CI reject: [V: 04-1] prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall) [23:44:17] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host an-coord1003.mgmt.eqiad.wmnet with reboot policy FORCED [23:46:52] (03PS6) 10BCornwall: prometheus: Generate varnish params file [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) [23:47:05] Don't mind me.... [23:50:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host an-coord1004.mgmt.eqiad.wmnet with reboot policy FORCED [23:53:06] !log start running cuc_comment_id population script on rest of sections in screens with --sleep 2 # T233004 [23:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:10] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [23:53:51] (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39121/console" [puppet] - 10https://gerrit.wikimedia.org/r/879660 (https://phabricator.wikimedia.org/T323723) (owner: 10BCornwall)