[00:01:05] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:11] RECOVERY - puppet last run on wcqs2002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:02:39] RECOVERY - puppet last run on wcqs2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:04:15] (03CR) 10Tim Starling: [C: 03+2] Discovery: codfw should be pooled for api-ro and appservers-ro [puppet] - 10https://gerrit.wikimedia.org/r/818652 (https://phabricator.wikimedia.org/T279664) (owner: 10Tim Starling) [00:06:19] RECOVERY - puppet last run on wcqs1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:06:33] (03PS5) 10Tim Starling: Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247 [00:07:20] (03CR) 10Tim Starling: [C: 03+2] Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247 (owner: 10Tim Starling) [00:08:46] (03Merged) 10jenkins-bot: Explicitly declare replaceable settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820247 (owner: 10Tim Starling) [00:13:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [00:13:51] !log tstarling@deploy1002 Synchronized tests: config tests, for consistency g 820247 (duration: 03m 22s) [00:15:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [00:15:38] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [00:16:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [00:18:38] !log tstarling@deploy1002 Synchronized wmf-config/InitialiseSettings.php: replaceableSettings g 820247 (duration: 03m 18s) [00:21:59] RECOVERY - puppet last run on wcqs1003 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:22:19] RECOVERY - puppet last run on wcqs1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:25:31] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:07] (03CR) 10Dzahn: [C: 03+1] "looking at the sizes at gitlab1003..this makes sense to me. we don't even need close to that on / but we do need it for backups" [puppet] - 10https://gerrit.wikimedia.org/r/823115 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [00:26:25] RECOVERY - puppet last run on wcqs2003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:30:15] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:09] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:33:51] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [00:55:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [01:20:09] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:30:57] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [01:34:23] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:37:23] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:52] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1001.wikimedia.org with OS bullseye [01:40:43] (03PS6) 10Xcollazo: airflow - Configure new platform_eng instance and rename old one as legacy. [puppet] - 10https://gerrit.wikimedia.org/r/817774 (https://phabricator.wikimedia.org/T312858) [01:42:45] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:46:39] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:58:05] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220816T0200) [02:00:19] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:04:53] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:06:57] PROBLEM - Check systemd state on ms-be1029 is CRITICAL: CRITICAL - degraded: The following units failed: swift-drive-audit.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:45] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:09:43] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:03] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:07] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:16:41] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:16:41] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:45] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:07] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:29] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:49] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:30:49] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:35:33] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.78 ms [02:37:49] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:47:15] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:13] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:51:03] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [02:57:27] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.81 ms [02:58:57] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:17] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:33] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:03:13] RECOVERY - Check systemd state on ms-be1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:11:15] (03PS5) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) [03:13:03] PROBLEM - MariaDB Replica SQL: x2 #page on db2144 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table mainstash.objectstash: Cant find record in objectstash, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1151-bin.000810, end_log_pos 5460559 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:14:49] PROBLEM - MariaDB Replica SQL: x2 #page on db1151 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Delete_rows_v1 event on table mainstash.objectstash: Cant find record in objectstash, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db2144-bin.000813, end_log_pos 1038458517 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:21:54] o/ looking into depooling replicas ^^ [03:27:09] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [03:30:23] hmm, it seems those are x2 masters [03:31:13] PROBLEM - MariaDB Replica Lag: x2 #page on db1151 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1380.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:31:53] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:33:01] cwhite: hi, the "Delete_rows_v1 event" seems unusual and both are masters [03:33:14] mutante: indeed, I think we need a dba [03:33:16] I guess we are supposed to escalate to DBA per docs [03:35:35] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:41:21] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:41:35] we are trying to call DBAs..in progress [03:41:55] mutante: I just woke up with the page [03:42:21] Amir1: x2 masters stopped replicating [03:42:32] https://phabricator.wikimedia.org/T315271 [03:43:01] looking [03:43:24] talking on phone [03:46:03] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:53] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:50:57] (03PS1) 10Cwhite: Revert "Switch www.mediawiki.org to multi-DC mode" [puppet] - 10https://gerrit.wikimedia.org/r/823231 [03:52:44] (03CR) 10Dzahn: [C: 03+1] "as advised by DBA after pages due to problems with the x2 db cluster" [puppet] - 10https://gerrit.wikimedia.org/r/823113 (owner: 10Tim Starling) [03:52:46] (03CR) 10Cwhite: [C: 03+2] Revert "Switch www.mediawiki.org to multi-DC mode" [puppet] - 10https://gerrit.wikimedia.org/r/823231 (owner: 10Cwhite) [03:53:44] (03CR) 10Dzahn: [C: 03+1] "advised by DBA after problems in x2 DB cluster" [puppet] - 10https://gerrit.wikimedia.org/r/823231 (owner: 10Cwhite) [03:53:57] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage - ryankemper@cumin1001 - T289135 [03:54:01] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [03:55:16] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage - ryankemper@cumin1001 - T289135 [03:56:04] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reimage - ryankemper@cumin1001 - T289135 [03:57:18] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1059.eqiad.wmnet with OS bullseye [03:57:25] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1059.eqiad.wmnet with OS bullseye [03:57:43] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:00:01] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:39] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Dzahn) [04:09:07] RECOVERY - MariaDB Replica SQL: x2 #page on db2144 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:11:29] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:55] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1059.eqiad.wmnet with reason: host reimage [04:12:39] PROBLEM - MariaDB Replica IO: x2 #page on db2143 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: could not find next log: the first event db2144-bin.000670 at 517376598, the last event read from db2144-bin.000813 at 1044345737, the last byte read from db2144-bin.000813 at 1044345768. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooti [04:12:39] ooling_a_replica [04:13:13] PROBLEM - MariaDB Replica IO: x2 #page on db2142 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: could not find next log: the first event db2144-bin.000614 at 40971090, the last event read from db2144-bin.000813 at 1044345737, the last byte read from db2144-bin.000813 at 1044345768. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshootin [04:13:13] oling_a_replica [04:14:23] PROBLEM - MariaDB Replica IO: x2 #page on db1153 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: could not find next log: the first event db1151-bin.000060 at 832064175, the last event read from db1151-bin.000810 at 415298659, the last byte read from db1151-bin.000810 at 415298690. https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [04:14:23] ling_a_replica [04:14:37] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1059.eqiad.wmnet with reason: host reimage [04:15:12] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:20] PROBLEM - MariaDB Replica IO: x2 #page on db1152 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: Could not find first log file name in binary log index file https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:19:20] RECOVERY - MariaDB Replica IO: x2 #page on db1152 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:24:40] PROBLEM - MariaDB Replica SQL: x2 #page on db1152 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1032, Errmsg: Could not execute Update_rows_v1 event on table mainstash.objectstash: Cant find record in objectstash, Error_code: 1032: handler error HA_ERR_KEY_NOT_FOUND: the events master log db1151-bin.000001, end_log_pos 2867 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:27:00] RECOVERY - MariaDB Replica Lag: x2 #page on db1151 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:27:07] PROBLEM - MariaDB Replica Lag: x2 #page on db1152 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1181.71 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:28:35] PROBLEM - MariaDB Replica Lag: x2 #page on db1153 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1270.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:29:22] RECOVERY - MariaDB Replica SQL: x2 #page on db1151 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:31:54] (03PS1) 10Ladsgroup: Switch mainstash back to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823275 [04:35:21] PROBLEM - MariaDB Replica IO: x2 #page on db2144 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 1236, Errmsg: Got fatal error 1236 from master when reading data from binary log: Error: connecting slave requested to start from GTID 180363291-180363291-1635405, which is not in the masters binlog https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:35:54] (03CR) 10Ladsgroup: [C: 03+2] Switch mainstash back to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823275 (owner: 10Ladsgroup) [04:37:20] (03Merged) 10jenkins-bot: Switch mainstash back to redis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823275 (owner: 10Ladsgroup) [04:38:15] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1059.eqiad.wmnet with OS bullseye [04:38:22] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1059.eqiad.wmnet with OS bullseye completed: - elastic1059 (... [04:41:17] RECOVERY - MariaDB Replica IO: x2 #page on db2144 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:43:11] (03PS1) 10Ladsgroup: Revert "Switch mainstash back to redis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823232 [04:43:26] (03CR) 10Ladsgroup: [C: 03+2] Revert "Switch mainstash back to redis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823232 (owner: 10Ladsgroup) [04:43:56] RECOVERY - MariaDB Replica SQL: x2 #page on db1152 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:44:13] (03Merged) 10jenkins-bot: Revert "Switch mainstash back to redis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823232 (owner: 10Ladsgroup) [04:44:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:45:34] PROBLEM - Check systemd state on elastic1059 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:40] RECOVERY - MariaDB Replica Lag: x2 #page on db1152 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:45:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:45:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:46:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:49:14] RECOVERY - MariaDB Replica IO: x2 #page on db1153 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:50:22] RECOVERY - MariaDB Replica Lag: x2 #page on db1153 is OK: OK slave_sql_lag Replication lag: 0.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:51:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [04:52:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [04:52:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [04:53:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [04:56:14] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:57:57] (03CR) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) (owner: 10Abijeet Patro) [04:58:03] (03PS6) 10Abijeet Patro: Enable message bundle on MetaWiki for WikiLearn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/820869 (https://phabricator.wikimedia.org/T311587) [05:01:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 12:00:00 on db[2142-2143].codfw.wmnet with reason: After-canary [05:01:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 12:00:00 on db[2142-2143].codfw.wmnet with reason: After-canary [05:01:51] (03PS1) 10Tim Starling: Switch off multi-DC for testwiki and test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/823516 [05:02:26] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:04:26] (03CR) 10Tim Starling: [C: 03+2] Switch off multi-DC for testwiki and test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/823516 (owner: 10Tim Starling) [05:17:40] (03CR) 10Jdlrobson: [C: 03+1] "sorry i missed this in your first patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823268 (https://phabricator.wikimedia.org/T312295) (owner: 10Clare Ming) [05:21:38] RECOVERY - Check systemd state on elastic1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:23:44] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:34:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s1 T314380 [05:35:04] T314380: Switchover s1 master (db1163 -> db1118) - https://phabricator.wikimedia.org/T314380 [05:35:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s1 T314380 [05:35:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1118 with weight 0 T314380', diff saved to https://phabricator.wikimedia.org/P32393 and previous config saved to /var/cache/conftool/dbconfig/20220816-053534-ladsgroup.json [05:39:16] (03PS1) 10Tim Starling: Remove codfw hosts from X-Wikimedia-Debug [puppet] - 10https://gerrit.wikimedia.org/r/823518 [05:39:44] PROBLEM - Check systemd state on elastic1051 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:40:59] (03PS2) 10Ladsgroup: mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/819557 (https://phabricator.wikimedia.org/T314380) (owner: 10Gerrit maintenance bot) [05:41:01] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Promote db1118 to s1 master [puppet] - 10https://gerrit.wikimedia.org/r/819557 (https://phabricator.wikimedia.org/T314380) (owner: 10Gerrit maintenance bot) [05:43:16] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: name=codfw,dnsdisc=(appservers|api)-ro [05:43:21] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: name=codfw,dnsdisc=(appservers|api)-ro [05:43:30] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:50] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:45:54] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:57:32] (03CR) 10Urbanecm: "removing -2 per discussion on the task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823148 (https://phabricator.wikimedia.org/T315199) (owner: 10Stang) [05:57:36] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:57:46] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:58:22] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:58:30] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:00:05] kormat, marostegui, and Amir1: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220816T0600). [06:01:43] need a couple of minutes [06:02:20] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:04:12] !log Starting s1 eqiad failover from db1163 to db1118 - T314380 [06:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:16] T314380: Switchover s1 master (db1163 -> db1118) - https://phabricator.wikimedia.org/T314380 [06:04:32] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:04:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T314380', diff saved to https://phabricator.wikimedia.org/P32394 and previous config saved to /var/cache/conftool/dbconfig/20220816-060455-ladsgroup.json [06:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1118 to s1 primary and set section read-write T314380', diff saved to https://phabricator.wikimedia.org/P32395 and previous config saved to /var/cache/conftool/dbconfig/20220816-060530-ladsgroup.json [06:05:47] writes are flowing again [06:09:08] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:09:58] (03PS2) 10Ladsgroup: wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/819558 (https://phabricator.wikimedia.org/T314380) (owner: 10Gerrit maintenance bot) [06:10:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Remove obsolete apache configuration files [puppet] - 10https://gerrit.wikimedia.org/r/761718 (owner: 10Zabe) [06:10:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "Thanks a lot Zabe!" [puppet] - 10https://gerrit.wikimedia.org/r/761718 (owner: 10Zabe) [06:10:52] (03CR) 10Ladsgroup: [C: 03+2] wmnet: Update s1-master alias [dns] - 10https://gerrit.wikimedia.org/r/819558 (https://phabricator.wikimedia.org/T314380) (owner: 10Gerrit maintenance bot) [06:13:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:14:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1163 T314380', diff saved to https://phabricator.wikimedia.org/P32396 and previous config saved to /var/cache/conftool/dbconfig/20220816-061413-ladsgroup.json [06:14:20] T314380: Switchover s1 master (db1163 -> db1118) - https://phabricator.wikimedia.org/T314380 [06:14:48] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.454 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:14:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 48535 bytes in 0.117 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:23:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maint work on old s1 master (T312984 T312863 T310011 T309311 T60674 T298560 T298555 T310485 T301312) [06:23:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maint work on old s1 master (T312984 T312863 T310011 T309311 T60674 T298560 T298555 T310485 T301312) [06:24:02] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [06:24:02] T60674: Drop page.page_restrictions column from Wikimedia wikis - https://phabricator.wikimedia.org/T60674 [06:24:02] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [06:24:03] T301312: Switchover s1 master (db1118 -> db1163) - https://phabricator.wikimedia.org/T301312 [06:24:03] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:24:03] T309311: Make user_editcount unsigned in production - https://phabricator.wikimedia.org/T309311 [06:24:04] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [06:24:04] T298555: Fix mismatching field type of logging.log_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298555 [06:25:30] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:25:54] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:43] (03CR) 10Matthias Mullie: "We've done a couple of manual runs and it all seems to work fine. This can be automated." [puppet] - 10https://gerrit.wikimedia.org/r/811312 (https://phabricator.wikimedia.org/T300024) (owner: 10Matthias Mullie) [06:29:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1169', diff saved to https://phabricator.wikimedia.org/P32397 and previous config saved to /var/cache/conftool/dbconfig/20220816-062955-ladsgroup.json [06:31:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maint [06:31:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maint [06:32:59] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ArielGlenn) Hey @Ottomata it's great to see these hosts moving loser to being in production! One thing I noticed, they are picki... [06:41:03] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:51:51] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:57:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T314041)', diff saved to https://phabricator.wikimedia.org/P32398 and previous config saved to /var/cache/conftool/dbconfig/20220816-065721-ladsgroup.json [06:57:25] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [06:58:03] !log hashar@deploy1002 Started deploy [integration/docroot@c142ba7]: Drop archived wikibase-vuejs-components storybook - T309872 [06:58:07] T309872: Archive wikibase-vuejs-components library repository - https://phabricator.wikimedia.org/T309872 [06:58:12] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: ms-be2028 /dev/sdg1 has failed [puppet] - 10https://gerrit.wikimedia.org/r/823178 (https://phabricator.wikimedia.org/T315213) (owner: 10MVernon) [06:58:14] !log hashar@deploy1002 Finished deploy [integration/docroot@c142ba7]: Drop archived wikibase-vuejs-components storybook - T309872 (duration: 00m 10s) [06:58:28] (03CR) 10Filippo Giunchedi: [C: 03+1] tcpircbot: send tcpircbot logs to centralized logging [puppet] - 10https://gerrit.wikimedia.org/r/822423 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [06:59:14] (03CR) 10Filippo Giunchedi: [C: 03+1] tcpircbot: add and enable ecs logging handler [puppet] - 10https://gerrit.wikimedia.org/r/822421 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite) [07:00:05] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220816T0700). [07:00:05] No Gerrit patches in the queue for this window AFAICS. [07:12:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P32399 and previous config saved to /var/cache/conftool/dbconfig/20220816-071227-ladsgroup.json [07:13:02] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) [07:13:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1169.eqiad.wmnet with reason: Maint [07:13:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1169.eqiad.wmnet with reason: Maint [07:17:18] (03PS1) 10Ladsgroup: db1169: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/823579 [07:18:59] (03CR) 10Ladsgroup: [C: 03+2] db1169: Disable notification [puppet] - 10https://gerrit.wikimedia.org/r/823579 (owner: 10Ladsgroup) [07:21:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:21:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [07:21:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [07:21:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [07:26:17] !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2067.codfw.wmnet [07:26:17] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2067.codfw.wmnet [07:26:33] (03CR) 10MVernon: [C: 03+2] swift: ms-be2028 /dev/sdg1 has failed [puppet] - 10https://gerrit.wikimedia.org/r/823178 (https://phabricator.wikimedia.org/T315213) (owner: 10MVernon) [07:27:13] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [07:27:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P32400 and previous config saved to /var/cache/conftool/dbconfig/20220816-072733-ladsgroup.json [07:40:47] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:42:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T314041)', diff saved to https://phabricator.wikimedia.org/P32401 and previous config saved to /var/cache/conftool/dbconfig/20220816-074239-ladsgroup.json [07:42:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:42:44] T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041 [07:42:54] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [07:43:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T314041)', diff saved to https://phabricator.wikimedia.org/P32402 and previous config saved to /var/cache/conftool/dbconfig/20220816-074259-ladsgroup.json [07:45:27] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:08:32] (03PS1) 10Jdlrobson: Enable new Vector skin on select pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823587 (https://phabricator.wikimedia.org/T314286) [08:16:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [08:16:16] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1163.eqiad.wmnet with reason: Maintenance [08:16:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [08:16:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Maintenance [08:27:43] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:28:08] (03CR) 10Giuseppe Lavagetto: role::alerting_host: run vopsbot (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/821255 (https://phabricator.wikimedia.org/T314840) (owner: 10Giuseppe Lavagetto) [08:32:01] 10SRE, 10SRE-OnFire, 10Observability-Alerting, 10Patch-For-Review: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) [08:33:22] (03PS4) 10Giuseppe Lavagetto: role::alerting_host: run vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/821255 (https://phabricator.wikimedia.org/T314840) [08:36:59] (KubernetesRsyslogDown) firing: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:41:59] (KubernetesRsyslogDown) resolved: (2) rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:42:15] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:57:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] role::alerting_host: run vopsbot [puppet] - 10https://gerrit.wikimedia.org/r/821255 (https://phabricator.wikimedia.org/T314840) (owner: 10Giuseppe Lavagetto) [09:00:25] 10SRE-swift-storage, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-fgiunchedi: Followups for Tegola and Swift interactions - https://phabricator.wikimedia.org/T307184 (10fgiunchedi) Status update on this: there's a root `screen` on `thanos-fe2001` to delete `tegola-swift-new` and `tegola-swift-cont... [09:02:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1163.eqiad.wmnet with reason: Maintenance [09:02:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1163.eqiad.wmnet with reason: Maintenance [09:02:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on db1169.eqiad.wmnet with reason: Maintenance [09:02:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on db1169.eqiad.wmnet with reason: Maintenance [09:02:36] (03PS1) 10Giuseppe Lavagetto: vopsbot: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/823595 [09:05:26] (03CR) 10Giuseppe Lavagetto: [C: 03+2] vopsbot: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/823595 (owner: 10Giuseppe Lavagetto) [09:05:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1163.eqiad.wmnet with reason: Maintenance [09:05:47] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1163.eqiad.wmnet with reason: Maintenance [09:05:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on db1169.eqiad.wmnet with reason: Maintenance [09:05:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on db1169.eqiad.wmnet with reason: Maintenance [09:05:51] (03PS10) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts [puppet] - 10https://gerrit.wikimedia.org/r/820749 [09:14:19] (03PS1) 10Giuseppe Lavagetto: vopsbot: make team, vo_admin optional for users [puppet] - 10https://gerrit.wikimedia.org/r/823596 [09:17:07] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add stub data for profile::vopsbot [labs/private] - 10https://gerrit.wikimedia.org/r/822417 (https://phabricator.wikimedia.org/T314840) (owner: 10Giuseppe Lavagetto) [09:17:15] (03CR) 10Clément Goubert: [C: 03+1] "Thanks for the changes" [puppet] - 10https://gerrit.wikimedia.org/r/821255 (https://phabricator.wikimedia.org/T314840) (owner: 10Giuseppe Lavagetto) [09:19:42] (03PS2) 10Giuseppe Lavagetto: vopsbot: make team, vo_admin optional for users [puppet] - 10https://gerrit.wikimedia.org/r/823596 [09:21:46] (03PS3) 10Giuseppe Lavagetto: vopsbot: make team, vo_admin optional for users [puppet] - 10https://gerrit.wikimedia.org/r/823596 [09:23:49] (03PS15) 10Fomafix: Add additional aliases for sr-cyrl and sr-latn next to sr-ec and sr-el [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) [09:24:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:26:45] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:29:49] (03PS4) 10Giuseppe Lavagetto: vopsbot: make team, vo_admin optional for users [puppet] - 10https://gerrit.wikimedia.org/r/823596 [09:32:08] (03PS5) 10Giuseppe Lavagetto: vopsbot: make team, vo_admin optional for users [puppet] - 10https://gerrit.wikimedia.org/r/823596 [09:33:28] (03PS1) 10Jbond: P:cumin: fix alias for dse_k8s::master [puppet] - 10https://gerrit.wikimedia.org/r/823598 [09:33:43] <_joe_> thanks jbond I was meaning to fix it heh [09:33:45] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36751/console" [puppet] - 10https://gerrit.wikimedia.org/r/823596 (owner: 10Giuseppe Lavagetto) [09:33:50] (03CR) 10Jbond: [C: 03+2] "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/823598 (owner: 10Jbond) [09:35:12] np [09:35:32] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:cumin: fix alias for dse_k8s::master [puppet] - 10https://gerrit.wikimedia.org/r/823598 (owner: 10Jbond) [09:35:40] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] vopsbot: make team, vo_admin optional for users [puppet] - 10https://gerrit.wikimedia.org/r/823596 (owner: 10Giuseppe Lavagetto) [09:43:25] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:44:37] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:55] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:47:31] (03CR) 10Jelto: [C: 03+2] install_server: change partman config for gitlab [puppet] - 10https://gerrit.wikimedia.org/r/823115 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [09:48:02] (03PS1) 10Giuseppe Lavagetto: vopsbot: fix yaml escaping of # [puppet] - 10https://gerrit.wikimedia.org/r/823600 [09:49:47] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:50:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] vopsbot: fix yaml escaping of # [puppet] - 10https://gerrit.wikimedia.org/r/823600 (owner: 10Giuseppe Lavagetto) [09:52:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:54:11] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:39] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:24] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: thanos-be2002 sdj failed - https://phabricator.wikimedia.org/T314913 (10MatthewVernon) In case it helps, `lshw -C disk` tells me `/dev/sdj` is `bus info: scsi@0:2.9.0` and `megacli -ldpdinfo -aall` tells me `Target Id: 9` is associated with the physical di... [10:23:58] 10SRE, 10Wikimedia-Mailing-lists: lists.wikimedia.org flapping between soon to expire and renewed cert (Aug 2022) - https://phabricator.wikimedia.org/T315294 (10RhinosF1) [10:26:50] (03PS2) 10Aklapper: xhgui: scrape Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/622447 (https://phabricator.wikimedia.org/T256039) (owner: 10Dave Pifke) [10:28:48] (03PS2) 10Aklapper: maps: introduce imposm-geometry-import [puppet] - 10https://gerrit.wikimedia.org/r/752748 (https://phabricator.wikimedia.org/T218097) (owner: 10MSantos) [10:29:25] (03CR) 10CI reject: [V: 04-1] maps: introduce imposm-geometry-import [puppet] - 10https://gerrit.wikimedia.org/r/752748 (https://phabricator.wikimedia.org/T218097) (owner: 10MSantos) [10:30:27] !log reimaging gitlab2003 (insetup) to test partman recipe from gerrit:823115 - T274463 [10:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:31] T274463: Backups for GitLab - https://phabricator.wikimedia.org/T274463 [10:34:18] !log jelto@cumin1001 START - Cookbook sre.hosts.reimage for host gitlab2003.wikimedia.org with OS bullseye [10:38:24] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-b,name=dbproxy1019.eqiad.wmnet [10:40:36] !log btullis@puppetmaster1001 conftool action : set/pooled=inactive; selector: cluster=wikireplicas-b,name=dbproxy1018.eqiad.wmnet [10:43:20] (03PS1) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [10:45:45] (03CR) 10Hashar: [C: 03+1] Add missing attrs dependency [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/822734 (owner: 10BryanDavis) [10:49:02] !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad [10:49:46] !log jayme@cumin1001 END (FAIL) - Cookbook sre.k8s.reboot-nodes (exit_code=1) rolling reboot on A:wikikube-staging-worker-eqiad [10:50:54] !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [10:51:45] 10SRE, 10Performance-Team, 10Traffic: Enable HTTP compression for arclamp trace logs - https://phabricator.wikimedia.org/T305783 (10Aklapper) a:05dpifke→03None Removing inactive task assignee (please do so as part of offboarding processes). [10:51:54] 10SRE, 10Performance-Team, 10observability: Add monitoring for performance.wikimedia.org - https://phabricator.wikimedia.org/T277927 (10Aklapper) a:05dpifke→03None Removing inactive task assignee (please do so as part of offboarding processes). [10:52:14] 10SRE, 10serviceops-radar, 10Patch-For-Review, 10Performance-Team (Radar), 10Service-deployment-requests: New Service Request: xhgui - https://phabricator.wikimedia.org/T277483 (10Aklapper) a:05dpifke→03None Removing inactive task assignee (please do so as part of offboarding processes). [10:52:24] 10SRE-swift-storage, 10Arc-Lamp, 10Performance-Team, 10Patch-For-Review: Swift container for performance flame graphs (ArcLamp) - https://phabricator.wikimedia.org/T244776 (10Aklapper) a:05dpifke→03None Removing inactive task assignee (please do so as part of offboarding processes). [10:52:48] 10SRE, 10Performance-Team, 10Traffic: Review socket balancing in ATS/Varnish traffic layers - https://phabricator.wikimedia.org/T248522 (10Aklapper) a:05dpifke→03None Removing inactive task assignee (please do so as part of offboarding processes). [10:52:55] 10SRE, 10Analytics-Radar, 10Recommendation-API: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10Aklapper) a:05dpifke→03None Removing inactive task assignee (please do so as part of offboarding processes). [10:53:38] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gitlab2003.wikimedia.org with reason: host reimage [10:53:40] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10Aklapper) a:05dpifke→03None Removing inactive task assignee (please do so as part of offboarding processes). [10:59:11] (03PS3) 10Aklapper: Create new http://www.mediawiki.org/xml/sitelist-1.1/ to reference sitelist-1.1.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/697110 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri) [11:02:19] (03PS2) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [11:02:32] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [11:03:11] !log btullis@puppetmaster1001 conftool action : set/pooled=no; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [11:05:43] (03CR) 10CI reject: [V: 04-1] reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [11:08:58] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host gitlab2003.wikimedia.org with OS bullseye [11:11:13] 10SRE, 10SRE-OnFire, 10Observability-Alerting: Productionize vopsbot - https://phabricator.wikimedia.org/T314840 (10Joe) 05Open→03Resolved [11:14:47] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:16:39] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:22:50] (03PS3) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [11:24:01] !log btullis@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [11:24:19] !log btullis@puppetmaster1001 conftool action : set/pooled=inactive; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [11:24:47] (03PS4) 10Jbond: reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) [11:26:57] (03CR) 10CI reject: [V: 04-1] reqconfig: add ip validation for ipblocks [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [11:27:51] (03CR) 10Jbond: reqconfig: add ip validation for ipblocks (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/823608 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [11:28:23] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-tails-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:30:35] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:31:29] (03PS4) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) [11:37:41] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36753/console" [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [11:59:29] PROBLEM - confd service on sretest1001 is CRITICAL: CRITICAL - Expecting active but unit confd is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:02:52] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [12:05:38] 10SRE, 10conftool, 10Patch-For-Review: Add requestctl support to ferm - https://phabricator.wikimedia.org/T313825 (10jbond) > In my brief testing it appears that ferm is happy to accept bogus IPs indeed and this also affects the ferm::rule resource. i.e. a typo in a ferm::rule could cause the entire firewal... [12:06:37] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:55] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:45] RECOVERY - confd service on sretest1001 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:13:04] 10Puppet, 10Infrastructure-Foundations: Ferm unloads all iptables rules when it hits a parsing error - https://phabricator.wikimedia.org/T315305 (10jbond) [12:15:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Ottomata) Wrong andrew, I think you meant to ping @Andrew ? [12:19:38] (03PS1) 10Jbond: P:base::firewall: use reload instead of restart [puppet] - 10https://gerrit.wikimedia.org/r/823616 (https://phabricator.wikimedia.org/T313825) [12:20:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ArielGlenn) >>! In T302981#8157062, @Ottomata wrote: > Wrong andrew, I think you meant to ping @Andrew ? Bah, yes I did. Thank... [12:21:07] 10SRE, 10Epic, 10Goal: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10Aklapper) [12:21:30] 10SRE, 10conftool, 10Patch-For-Review: Add requestctl support to ferm - https://phabricator.wikimedia.org/T313825 (10jbond) > In my brief testing it appears that ferm is happy to accept bogus IPs Worth noting that if you use `systemd reload ferm` instead of restart then it at least leaves iptables with the l... [12:21:35] (03CR) 10Jbond: [C: 03+2] P:base::firewall: use reload instead of restart [puppet] - 10https://gerrit.wikimedia.org/r/823616 (https://phabricator.wikimedia.org/T313825) (owner: 10Jbond) [12:22:14] 10SRE, 10observability, 10serviceops, 10Patch-For-Review, and 2 others: Create an alert for high memcached bw usage - https://phabricator.wikimedia.org/T224454 (10Aklapper) @CDanis: Only https://gerrit.wikimedia.org/r/c/operations/puppet/+/691216 is still open on this ticket, should that be merged or aband... [12:23:05] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:23:53] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10serviceops: schedule downtime for contint2001 - https://phabricator.wikimedia.org/T294271 (10hashar) [12:23:58] 10SRE, 10ops-codfw, 10Continuous-Integration-Infrastructure, 10DC-Ops, 10netops: DRAC firmware upgrades codfw (was: Flapping codfw management alarm ( contint2001.mgmt/SSH is CRITICAL ))) - https://phabricator.wikimedia.org/T283582 (10hashar) [12:24:40] (03PS6) 10Aklapper: Adjust CSP header for pdfs & videos & set enforce on testwiki [puppet] - 10https://gerrit.wikimedia.org/r/547929 (https://phabricator.wikimedia.org/T117618) (owner: 10Brian Wolff) [12:26:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [12:26:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1163.eqiad.wmnet with reason: Maintenance [12:26:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [12:26:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [12:27:56] (03PS5) 10Aklapper: Set $wgUploadNavigationUrl for few wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/364121 (https://phabricator.wikimedia.org/T170083) (owner: 10Framawiki) [12:28:56] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [12:31:47] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:39:32] (03CR) 10Vgutierrez: [C: 03+2] Enable query sorting for all testwiki requests [puppet] - 10https://gerrit.wikimedia.org/r/819677 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori) [12:48:31] (03PS1) 10Jbond: C:ferm: update ferm to use restart-or-reload instead of restart [puppet] - 10https://gerrit.wikimedia.org/r/823621 (https://phabricator.wikimedia.org/T315305) [12:49:35] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36754/console" [puppet] - 10https://gerrit.wikimedia.org/r/823621 (https://phabricator.wikimedia.org/T315305) (owner: 10Jbond) [12:50:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Andrew) Thanks for the suggestion @ArielGlenn. Those hosts are really not working at all right now (something awful is happening... [12:54:47] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [12:55:35] (03CR) 10Vgutierrez: [C: 03+1] varnish::tests: add tests for query-sorting [puppet] - 10https://gerrit.wikimedia.org/r/822715 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [12:59:01] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: Your horoscope predicts another unfortunate UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220816T1300). [13:00:05] koi: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220816T1300) [13:00:56] o/ [13:01:17] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:04:41] !log jayme@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad [13:04:52] hello, is there anyone could deploy in this window? [13:08:17] hey, I can deploy toay [13:08:18] today* [13:08:49] thanks! [13:10:53] (03CR) 10Majavah: [C: 03+2] kowiki: Add logo (legacy vector and vector-2022) for 600k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822717 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [13:11:01] let's start from the logo [13:11:08] (03CR) 10Majavah: [C: 03+2] kowiki: Change logo for 600k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822718 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [13:12:20] (03Merged) 10jenkins-bot: kowiki: Add logo (legacy vector and vector-2022) for 600k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822717 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [13:13:29] (03PS4) 10Majavah: kowiki: Change logo for 600k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822718 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [13:14:04] (03CR) 10Majavah: [C: 03+2] kowiki: Change logo for 600k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822718 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [13:14:44] sorry this takes a while. Apparently I'm somewhat rusty on how to properly use gerrit when pulling several patches at once [13:15:20] (03Merged) 10jenkins-bot: kowiki: Change logo for 600k articles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822718 (https://phabricator.wikimedia.org/T315127) (owner: 10Stang) [13:15:31] finally [13:15:38] koi: can you test on mwdebug1001? [13:15:49] looking [13:16:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [13:16:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [13:16:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:16:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [13:17:10] taavi: tested in vector-2022, timeless and legacy vector, LGTM [13:17:21] thanks, syncing [13:17:49] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1057.eqiad.wmnet with OS bullseye [13:17:57] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1057.eqiad.wmnet with OS bullseye [13:20:02] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:20:53] !log taavi@deploy1002 Synchronized static/images: Config: [[gerrit:822717|kowiki: Add logo (legacy vector and vector-2022) for 600k articles (T315127)]] (duration: 03m 29s) [13:20:56] T315127: Requesting temporary logo change for ko.wikipedia.org - https://phabricator.wikimedia.org/T315127 [13:21:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:21:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:22:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:24:12] !log taavi@deploy1002 Synchronized wmf-config: Config: [[gerrit:822718|kowiki: Change logo for 600k articles (T315127)]] (duration: 03m 11s) [13:24:13] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [13:24:16] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [13:24:21] !log jayme@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-eqiad [13:24:45] deployed [13:24:54] ok, looking at that jawiki patch and the related discussion now [13:25:01] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:27:17] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [13:27:29] taavi: could you please flush the cache? [13:27:59] koi: which cache? [13:28:05] the logo one [13:28:33] you changed the file name, so there shouldn't be any previous CDN caches to flush [13:29:18] (03PS2) 10Majavah: jawiki: Restrict abusefilter log view to "abusefilter-modify" user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823148 (https://phabricator.wikimedia.org/T315199) (owner: 10Stang) [13:29:34] opened in another browser and looks fine [13:30:08] I guess there's some weird things happened in my ISP.. [13:31:37] (03CR) 10Majavah: [C: 03+2] "deploying as Martin's concerns have been addressed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823148 (https://phabricator.wikimedia.org/T315199) (owner: 10Stang) [13:32:09] 10SRE, 10MediaWiki-General, 10Traffic, 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), 10Patch-For-Review: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) [13:32:46] (03Merged) 10jenkins-bot: jawiki: Restrict abusefilter log view to "abusefilter-modify" user [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823148 (https://phabricator.wikimedia.org/T315199) (owner: 10Stang) [13:33:00] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1057.eqiad.wmnet with reason: host reimage [13:33:22] koi: can you test the abusefilter one on mwdebug1001? [13:33:45] looking [13:34:26] koi: sorry I didn't actually pull it properly, try now [13:36:28] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [13:36:30] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1057.eqiad.wmnet with reason: host reimage [13:37:06] taavi: this patch works as expected, I could not access https://ja.wikipedia.org/w/index.php?title=Special:Log/abusefilter&uselang=en now [13:37:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:37:20] deploying, thanks [13:38:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:38:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:38:22] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [13:38:23] !log jayme@cumin1001 END (FAIL) - Cookbook sre.discovery.service-route (exit_code=1) [13:38:29] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route [13:38:30] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) [13:39:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:39:46] <_joe_> jayme: that cookbook is broken, I have no idea why [13:39:51] 10Puppet, 10Infrastructure-Foundations, 10Documentation: update puppet documentation - https://phabricator.wikimedia.org/T315317 (10Aklapper) [13:40:05] <_joe_> I found out during the maintenance [13:40:07] _joe_: yeah, I know. That's exactly why I was running it :) [13:40:14] <_joe_> ah ok :) [13:40:49] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:823148|jawiki: Restrict abusefilter log view to "abusefilter-modify" user (T315199)]] (duration: 03m 21s) [13:40:52] T315199: Restrict viewing [[Special:Log/abusefilter]] only Abusefilter editors on ja.wikipedia - https://phabricator.wikimedia.org/T315199 [13:40:59] koi: done too! [13:41:05] anyone have anything else to deploy? [13:41:40] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [13:41:57] !log UTC afternoon deploys done [13:41:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:33] (03PS1) 10ArielGlenn: don't rsync to clouddumps1001,2 while they are still being set up [puppet] - 10https://gerrit.wikimedia.org/r/823649 (https://phabricator.wikimedia.org/T302981) [13:48:44] taavi: please revert this patch, there's some issue occurred [13:49:02] 10SRE, 10SRE-swift-storage, 10Performance-Team, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10MatthewVernon) We had a bit of a chat about this today, and thought it worth noting some of the reasons it would be good to actua... [13:49:17] or I'll come up with another patch, there's some data leakage [13:49:52] hey - what kind of issue? [13:50:45] by default only user with suppress permission could see log/suppress, but now everyone could see it [13:51:11] (03PS1) 10Majavah: Revert "jawiki: Restrict abusefilter log view to "abusefilter-modify" user" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823632 [13:51:17] (03CR) 10Majavah: [C: 03+2] Revert "jawiki: Restrict abusefilter log view to "abusefilter-modify" user" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823632 (owner: 10Majavah) [13:52:21] ok I'm reverting [13:52:28] it seems the "+" annotation does not works as expected, [13:52:48] RECOVERY - ElasticSearch numbers of masters eligible - 9443 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [13:52:50] (03Merged) 10jenkins-bot: Revert "jawiki: Restrict abusefilter log view to "abusefilter-modify" user" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823632 (owner: 10Majavah) [13:53:57] (03CR) 10Andrew Bogott: [C: 03+2] "thanks! Sorry for the noise." [puppet] - 10https://gerrit.wikimedia.org/r/823649 (https://phabricator.wikimedia.org/T302981) (owner: 10ArielGlenn) [13:55:29] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: revert: Config: [[gerrit:823148|jawiki: Restrict abusefilter log view to "abusefilter-modify" user (T315199)]] (duration: 03m 12s) [13:55:33] T315199: Restrict viewing [[Special:Log/abusefilter]] only Abusefilter editors on ja.wikipedia - https://phabricator.wikimedia.org/T315199 [13:55:47] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1057.eqiad.wmnet with OS bullseye [13:55:53] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1057.eqiad.wmnet with OS bullseye completed: - elastic1070 (... [13:55:58] (ThanosCompactIsDown) firing: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [13:56:45] (JobUnavailable) firing: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:57:55] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1077.eqiad.wmnet with OS bullseye [13:58:01] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1077.eqiad.wmnet with OS bullseye [13:59:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:59:37] 10SRE, 10conftool, 10Patch-For-Review: Add requestctl support to ferm - https://phabricator.wikimedia.org/T313825 (10jhathaway) >> how confident are we that etcd will never have bogus data? > As far as i can tell we dont do any validation, i have created a PS and will speak with joe to see if we can get it a... [14:00:04] 10SRE, 10conftool, 10Patch-For-Review: Add requestctl support to ferm - https://phabricator.wikimedia.org/T313825 (10jhathaway) >>! In T313825#8157069, @jbond wrote: >> In my brief testing it appears that ferm is happy to accept bogus IPs > Worth noting that if you use `systemd reload ferm` instead of restar... [14:00:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [14:00:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [14:01:06] 10Puppet, 10Infrastructure-Foundations, 10Patch-For-Review: Ferm unloads all iptables rules when it hits a parsing error - https://phabricator.wikimedia.org/T315305 (10jbond) p:05Triage→03Medium [14:01:15] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:01:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [14:03:06] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-services: Add euwiki, huwiki & hywiki drafttopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/823109 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira) [14:03:57] (03CR) 10Klausman: [C: 03+2] Add Cumin aliases for ml-cache [puppet] - 10https://gerrit.wikimedia.org/r/820129 (owner: 10Muehlenhoff) [14:05:58] (ThanosCompactIsDown) resolved: Thanos component has disappeared. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/0cb8830a6e957978796729870f560cda/thanos-overview - https://alerts.wikimedia.org/?q=alertname%3DThanosCompactIsDown [14:06:35] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant private data access to Purity Waigi - https://phabricator.wikimedia.org/T315257 (10cmooney) [14:06:45] (JobUnavailable) resolved: Reduced availability for job thanos-compact in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:21] (03Merged) 10jenkins-bot: ml-services: Add euwiki, huwiki & hywiki drafttopic isvcs to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/823109 (https://phabricator.wikimedia.org/T314456) (owner: 10Kevin Bazira) [14:08:22] (03PS1) 10Clément Goubert: mc2024: do not install redis [puppet] - 10https://gerrit.wikimedia.org/r/823650 [14:10:42] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1077.eqiad.wmnet with reason: host reimage [14:11:08] (03CR) 10JHathaway: "thanks for fixing this, just a couple of questions" [puppet] - 10https://gerrit.wikimedia.org/r/823621 (https://phabricator.wikimedia.org/T315305) (owner: 10Jbond) [14:13:40] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1077.eqiad.wmnet with reason: host reimage [14:13:46] (03PS2) 10Clément Goubert: mc2024: do not install redis [puppet] - 10https://gerrit.wikimedia.org/r/823650 [14:17:05] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/823650 (owner: 10Clément Goubert) [14:18:54] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10phaultfinder) [14:20:38] (03PS2) 10Jbond: C:ferm: update ferm to use restart-or-reload instead of restart [puppet] - 10https://gerrit.wikimedia.org/r/823621 (https://phabricator.wikimedia.org/T315305) [14:20:40] (03PS1) 10Jbond: C:ferm: fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/823651 [14:21:43] (03CR) 10Jbond: [C: 03+2] C:ferm: fix minor lint issues [puppet] - 10https://gerrit.wikimedia.org/r/823651 (owner: 10Jbond) [14:22:21] (03CR) 10Jbond: "thanks response inline" [puppet] - 10https://gerrit.wikimedia.org/r/823621 (https://phabricator.wikimedia.org/T315305) (owner: 10Jbond) [14:23:17] RECOVERY - Host cp1089.mgmt is UP: PING WARNING - Packet loss = 33%, RTA = 2.10 ms [14:23:36] (03PS5) 10Ori: varnish::tests: add tests for query-sorting [puppet] - 10https://gerrit.wikimedia.org/r/822715 (https://phabricator.wikimedia.org/T138093) [14:23:45] (03CR) 10JHathaway: [C: 03+1] "looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/823621 (https://phabricator.wikimedia.org/T315305) (owner: 10Jbond) [14:24:51] (03CR) 10Giuseppe Lavagetto: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/823650 (owner: 10Clément Goubert) [14:25:04] (03CR) 10Ori: [C: 03+2] varnish::tests: add tests for query-sorting [puppet] - 10https://gerrit.wikimedia.org/r/822715 (https://phabricator.wikimedia.org/T138093) (owner: 10Ori) [14:26:19] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route-jayme [14:29:56] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/823650 (owner: 10Clément Goubert) [14:30:24] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1077.eqiad.wmnet with OS bullseye [14:30:32] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1077.eqiad.wmnet with OS bullseye completed: - elastic1070 (... [14:31:23] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route-jayme (exit_code=0) [14:31:36] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:32:20] (03PS1) 10Cathal Mooney: admin: Add Purity Waigi to 'wmf' LDAP group and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/823654 (https://phabricator.wikimedia.org/T315257) [14:36:58] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [14:44:54] PROBLEM - Host cp1089.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:46:22] 10SRE, 10Data Engineering Planning, 10Data Pipelines: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10BTullis) [14:52:08] RECOVERY - Host cp1089.mgmt is UP: PING OK - Packet loss = 0%, RTA = 2.10 ms [15:04:26] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36758/console" [puppet] - 10https://gerrit.wikimedia.org/r/823650 (owner: 10Clément Goubert) [15:05:32] (03CR) 10Andrew Bogott: [C: 03+2] ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:05:36] (03PS6) 10Andrew Bogott: ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:07:08] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route-jayme [15:07:10] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route-jayme (exit_code=0) [15:07:20] !log jayme@cumin1001 START - Cookbook sre.discovery.service-route-jayme [15:07:26] (03PS7) 10David Caro: ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) [15:07:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mc2024: do not install redis [puppet] - 10https://gerrit.wikimedia.org/r/823650 (owner: 10Clément Goubert) [15:09:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/823654 (https://phabricator.wikimedia.org/T315257) (owner: 10Cathal Mooney) [15:10:21] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1074.eqiad.wmnet with OS bullseye [15:10:28] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1074.eqiad.wmnet with OS bullseye [15:12:24] !log jayme@cumin1001 END (PASS) - Cookbook sre.discovery.service-route-jayme (exit_code=0) [15:13:12] (03PS1) 10JMeybohm: sre.discovery.service-route: Make the cookbook work [cookbooks] - 10https://gerrit.wikimedia.org/r/823659 (https://phabricator.wikimedia.org/T260663) [15:13:38] (03CR) 10Andrew Bogott: [C: 03+2] ceph: use many cluster and public networks [puppet] - 10https://gerrit.wikimedia.org/r/822407 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [15:16:18] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:18:38] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:19:06] RECOVERY - Aggregate IPsec Tunnel Status codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [15:22:44] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:23:08] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1074.eqiad.wmnet with reason: host reimage [15:25:18] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Link from lsw1-e1-eqiad to lsw1-f2-eqiad down - https://phabricator.wikimedia.org/T315052 (10Cmjohnson) @cmooney The QSFP28 module for et-o/o/54 on lsw1-f3-eqiad has been replaced. [15:25:47] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1074.eqiad.wmnet with reason: host reimage [15:26:35] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10Papaul) I found a decom HP server in storage i can pull the battery out of it and use it in 2032 or 2035. Your call [15:27:45] (03PS1) 10David Caro: ceph: use the correct network for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/823664 [15:28:12] (03CR) 10David Caro: [C: 03+2] ceph: use the correct network for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/823664 (owner: 10David Caro) [15:29:09] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2032.codfw.wmnet with reason: RAID battery failure [15:29:23] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2032.codfw.wmnet with reason: RAID battery failure [15:29:32] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=223def49-5d66-43ab-a0f2-4305a3e04e56) set by mvernon@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their services wit... [15:29:54] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10MatthewVernon) Cool, thanks - put it in ms-be2032, please? I've just downtimed it and shut it down. [15:33:12] (03PS2) 10David Caro: ceph: rename CephOSDController to CephOSDNodeController [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823168 [15:33:14] (03PS2) 10David Caro: global: add inventory module [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823169 [15:33:16] (03PS1) 10David Caro: Openstack: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823666 [15:33:18] (03PS1) 10David Caro: ceph: use cluster_name instead of control node [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823667 [15:33:20] (03PS1) 10David Caro: ceph: use human-readable names for ceph clusters [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823668 [15:33:22] (03PS1) 10David Caro: ceph: use the correct codfw ceph mon hosts [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823669 [15:33:24] (03PS1) 10David Caro: ceph,opensatck: use the inventory to get the nodes domain [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823670 [15:33:26] (03PS1) 10David Caro: ceph: add roll_restart_osd_daemons cookbook [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/823671 [15:35:14] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [15:35:24] (03CR) 10Jaime Nuche: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1003/36720/" [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [15:37:28] jnuche: hey, around? I'm looking at the puppet window a little early as I'll have a meeting conflict :) [15:38:25] jnuche: happy to talk about it, but my instinct is this is too big to be a good patch for the window, per https://wikitech.wikimedia.org/wiki/Puppet_request_window -- we should get you a regular review from serviceops rather than try to jam it into 30 minutes [15:38:48] PROBLEM - Host ms-be2032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:39:14] _joe_ might want first refusal (reviewing https://gerrit.wikimedia.org/r/820749) but if he doesn't have strong feelings I'm happy to give it a look later today [15:40:53] 10SRE, 10ops-eqiad, 10Traffic: SSH on cp1089.mgmt is flapping - https://phabricator.wikimedia.org/T314951 (10Cmjohnson) 05Open→03Resolved replaced the cable [15:40:58] 10SRE, 10Data-Engineering, 10Foundational Technology Requests: Add a webrequest sampled topic and ingest into druid/turnilo - https://phabricator.wikimedia.org/T314981 (10Ottomata) I wanted to get a very stupid simple example of using Flink to sample webrequest in Kafka. Here's an example using purely strea... [15:41:07] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10Papaul) ms-be2032 looks happy [15:42:53] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1074.eqiad.wmnet with OS bullseye [15:42:59] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1074.eqiad.wmnet with OS bullseye completed: - elastic1070 (... [15:44:34] RECOVERY - Host ms-be2032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.47 ms [15:46:20] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10Papaul) 05Open→03Resolved Resolving this task [15:47:43] <_joe_> rzl: looking [15:48:10] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2032 - https://phabricator.wikimedia.org/T314427 (10MatthewVernon) Yes, all good in nagios now. Thank you :) [15:48:15] !log mvernon@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2032.codfw.wmnet [15:48:15] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2032.codfw.wmnet [15:51:49] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) I should be receiving a new disk sometimes today. If the new disk doesn't work then i will open a ticket with Dell. [15:52:59] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10MatthewVernon) Thanks :) [15:54:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Overall the patch looks good, but two things to check:" [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [15:54:30] rzl, _joe_ I can assist with https://gerrit.wikimedia.org/r/820749 if jnuche isn't around [15:55:10] <_joe_> dancy: I'm suggesting one simplification, but more importantly: that exec should live inside scap::master [15:55:26] <_joe_> and I think one of the hiera changes isn't doing what you intended it to do [15:55:26] OK. I'll leave that for Jaime to fix up. [15:55:50] <_joe_> I'm happy to change stuff myself if jnuche doesn't have time to figure that out [15:56:29] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T314997 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Re-balanced power a little, hopefully enough to stop the alert. [15:57:37] _joe_, dancy: just came back, I'll take a look [15:57:49] 10SRE, 10ops-eqiad, 10DC-Ops: ps1-e4-eqiad alerts - https://phabricator.wikimedia.org/T314027 (10Cmjohnson) 05Open→03Resolved pdu's are not fully setup yet [15:57:51] 10SRE, 10ops-eqiad, 10DC-Ops: Q1: eqiad: (32) PDUs for expansion - https://phabricator.wikimedia.org/T290899 (10Cmjohnson) [15:57:54] PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:00:05] jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220816T1600). [16:00:05] jnuche: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:02:04] !log btullis@deploy1002 Started deploy [airflow-dags/analytics@3c998da]: (no justification provided) [16:02:16] !log btullis@deploy1002 Finished deploy [airflow-dags/analytics@3c998da]: (no justification provided) (duration: 00m 12s) [16:03:03] (03PS1) 10Giuseppe Lavagetto: Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) [16:03:05] (03PS1) 10Giuseppe Lavagetto: Move 0.1% of user traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823675 (https://phabricator.wikimedia.org/T271736) [16:03:07] (03PS1) 10Giuseppe Lavagetto: Move 1% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823676 (https://phabricator.wikimedia.org/T271736) [16:03:09] (03PS1) 10Giuseppe Lavagetto: Move 5% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823677 (https://phabricator.wikimedia.org/T271736) [16:03:11] (03PS1) 10Giuseppe Lavagetto: Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736) [16:03:13] (03PS1) 10Giuseppe Lavagetto: Move 1 of 6 users to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823679 (https://phabricator.wikimedia.org/T271736) [16:03:15] (03PS1) 10Giuseppe Lavagetto: Move 50% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823680 (https://phabricator.wikimedia.org/T271736) [16:03:17] (03PS1) 10Giuseppe Lavagetto: Move 100% of cookie-accepting clients to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823681 (https://phabricator.wikimedia.org/T271736) [16:03:25] (03CR) 10CI reject: [V: 04-1] Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:03:30] ooh exciting [16:03:34] (03CR) 10CI reject: [V: 04-1] Move 0.1% of user traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823675 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:03:49] (03CR) 10CI reject: [V: 04-1] Move 1% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823676 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:04:07] (03CR) 10CI reject: [V: 04-1] Move 5% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823677 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:04:11] silly jenkins [16:04:29] (03CR) 10CI reject: [V: 04-1] Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:04:35] (03CR) 10Giuseppe Lavagetto: [V: 03+1] kubernetes::mediawiki::releases: allow scap users to write releases files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822610 (owner: 10Giuseppe Lavagetto) [16:04:39] Weird pattern on s2 since around 13:27 https://grafana.wikimedia.org/goto/jFtahEi4z?orgId=1 [16:04:57] (03CR) 10CI reject: [V: 04-1] Move 1 of 6 users to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823679 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:05:30] (03CR) 10CI reject: [V: 04-1] Move 50% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823680 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:06:06] (03CR) 10CI reject: [V: 04-1] Move 100% of cookie-accepting clients to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823681 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:06:37] (03PS2) 10Giuseppe Lavagetto: kubernetes::mediawiki::releases: allow scap users to write releases files [puppet] - 10https://gerrit.wikimedia.org/r/822610 [16:08:23] (03PS2) 10Giuseppe Lavagetto: Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) [16:08:25] (03PS2) 10Giuseppe Lavagetto: Move 0.1% of user traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823675 (https://phabricator.wikimedia.org/T271736) [16:08:27] (03PS2) 10Giuseppe Lavagetto: Move 1% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823676 (https://phabricator.wikimedia.org/T271736) [16:08:29] (03PS2) 10Giuseppe Lavagetto: Move 5% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823677 (https://phabricator.wikimedia.org/T271736) [16:08:31] (03PS2) 10Giuseppe Lavagetto: Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736) [16:08:33] (03PS2) 10Giuseppe Lavagetto: Move 1 of 6 users to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823679 (https://phabricator.wikimedia.org/T271736) [16:08:35] (03PS2) 10Giuseppe Lavagetto: Move 50% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823680 (https://phabricator.wikimedia.org/T271736) [16:08:37] (03PS2) 10Giuseppe Lavagetto: Move 100% of cookie-accepting clients to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823681 (https://phabricator.wikimedia.org/T271736) [16:08:58] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:09:38] (03CR) 10CI reject: [V: 04-1] Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:09:40] (03CR) 10CI reject: [V: 04-1] Move 0.1% of user traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823675 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:09:44] (03CR) 10CI reject: [V: 04-1] Move 1% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823676 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:09:56] (03CR) 10Ahmon Dancy: kubernetes::mediawiki::releases: allow scap users to write releases files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822610 (owner: 10Giuseppe Lavagetto) [16:10:01] (03CR) 10CI reject: [V: 04-1] Move 5% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823677 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:10:24] <_joe_> sigh sorry everyone [16:10:25] (03CR) 10CI reject: [V: 04-1] Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:10:29] (03PS3) 10Hnowlan: thumbor: new service chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/823143 (https://phabricator.wikimedia.org/T233196) [16:11:09] (03PS3) 10Giuseppe Lavagetto: Add variables regulating the php 7.4 transition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823674 (https://phabricator.wikimedia.org/T271736) [16:11:11] (03PS3) 10Giuseppe Lavagetto: Move 0.1% of user traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823675 (https://phabricator.wikimedia.org/T271736) [16:11:13] (03PS3) 10Giuseppe Lavagetto: Move 1% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823676 (https://phabricator.wikimedia.org/T271736) [16:11:15] (03PS3) 10Giuseppe Lavagetto: Move 5% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823677 (https://phabricator.wikimedia.org/T271736) [16:11:17] (03PS3) 10Giuseppe Lavagetto: Move 10% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823678 (https://phabricator.wikimedia.org/T271736) [16:11:19] (03PS3) 10Giuseppe Lavagetto: Move 1 of 6 users to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823679 (https://phabricator.wikimedia.org/T271736) [16:11:21] (03PS3) 10Giuseppe Lavagetto: Move 50% of traffic to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823680 (https://phabricator.wikimedia.org/T271736) [16:11:23] (03PS3) 10Giuseppe Lavagetto: Move 100% of cookie-accepting clients to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823681 (https://phabricator.wikimedia.org/T271736) [16:11:25] (03CR) 10CI reject: [V: 04-1] Move 1 of 6 users to php 7.4 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823679 (https://phabricator.wikimedia.org/T271736) (owner: 10Giuseppe Lavagetto) [16:13:10] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:13:27] (03CR) 10Jaime Nuche: scap: introduce bootstrapping mechanism specific to deployment hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/820749 (owner: 10Jaime Nuche) [16:14:10] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudcontrol100[34] - https://phabricator.wikimedia.org/T313268 (10Cmjohnson) [16:14:25] 10SRE, 10ops-eqiad, 10decommission-hardware, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission cloudcontrol100[34] - https://phabricator.wikimedia.org/T313268 (10Cmjohnson) 05Open→03Resolved [16:14:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudcontrol100[6-7].wikimedia.org - https://phabricator.wikimedia.org/T306853 (10Cmjohnson) [16:14:49] PROBLEM - IPMI Sensor Status on dbprov1002 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [16:15:42] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:(Need By: TBD) rack/setup/install db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T306928 (10Cmjohnson) Fixed the db1188 mgmt ip address [16:15:44] (03PS3) 10Giuseppe Lavagetto: kubernetes::mediawiki::releases: allow scap users to write releases files [puppet] - 10https://gerrit.wikimedia.org/r/822610 [16:16:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: Decomission conf100[456] - https://phabricator.wikimedia.org/T311408 (10Cmjohnson) 05In progress→03Resolved done [16:16:19] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission labweb1001 and labweb1002 - https://phabricator.wikimedia.org/T313861 (10Cmjohnson) 05Open→03Resolved [16:16:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Cmjohnson) [16:17:18] (03CR) 10Giuseppe Lavagetto: kubernetes::mediawiki::releases: allow scap users to write releases files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822610 (owner: 10Giuseppe Lavagetto) [16:17:37] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1049.eqiad.wmnet with OS bullseye [16:17:44] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1049.eqiad.wmnet with OS bullseye [16:18:38] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/822610 (owner: 10Giuseppe Lavagetto) [16:18:49] PROBLEM - Host db1188.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:18:49] PROBLEM - Host db1186.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:19:49] rzl, _joe_: I removed my patch from the window, I'll address _joe_'s comments and reschedule [16:20:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kubernetes::mediawiki::releases: allow scap users to write releases files [puppet] - 10https://gerrit.wikimedia.org/r/822610 (owner: 10Giuseppe Lavagetto) [16:21:21] RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:27] (03PS6) 10Cwhite: logstash: duplicate alert logs for loki target [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) [16:26:17] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:27:01] (03PS3) 10Majavah: puppetmaster: remove 'allow_from' [puppet] - 10https://gerrit.wikimedia.org/r/799859 [16:27:17] (03PS7) 10Cwhite: logstash: duplicate alert logs for loki target [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) [16:27:26] (03PS7) 10Ori: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles) [16:28:19] (03CR) 10Majavah: puppetmaster: remove 'allow_from' (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799859 (owner: 10Majavah) [16:28:53] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T315344 (10phaultfinder) [16:28:59] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1049.eqiad.wmnet with reason: host reimage [16:30:31] RECOVERY - Host db1188.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.86 ms [16:31:17] (03CR) 10Cwhite: logstash: duplicate alert logs for loki target (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [16:31:51] (03CR) 10Cwhite: [C: 03+2] logstash: duplicate alert logs for loki target [puppet] - 10https://gerrit.wikimedia.org/r/806349 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [16:33:12] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1049.eqiad.wmnet with reason: host reimage [16:33:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudcephosd10[25-34] Missing/unplugged hard drives - https://phabricator.wikimedia.org/T315221 (10Cmjohnson) I counted 10 disks per server, I spot checked cloudcephosd1025 all 10 disks show up in the raid Raid 1 is the 2 smaller disks Virt... [16:35:43] RECOVERY - Host db1186.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [16:36:07] jnuche: thanks! probably no need to reschedule in another window, we can just merge it in the regular review process, unless you want to make sure to be around to test it... but if you find _joe_ and I aren't getting back to you, feel free :) [16:36:17] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 429 Too Many Requests - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 754 bytes in 1.051 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [16:36:18] not that that's ever happened before, in the entire history of code reviews, obviously [16:36:53] rzl: hehehe, ack [16:39:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): cloudcephosd10[25-34] Missing/unplugged hard drives - https://phabricator.wikimedia.org/T315221 (10dcaro) We are using the hardware raid1 and one of the 1.8T raid0 disks as software raid1 :), from the lsblk: ` ... sda 8:0 0 446... [16:45:35] (03CR) 10Andrea Denisse: netmon: Set correct owner for the LibreNMS rrd directory. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/822204 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [16:45:59] (03Abandoned) 10Andrea Denisse: netmon: Set correct owner for the LibreNMS rrd directory. [puppet] - 10https://gerrit.wikimedia.org/r/822204 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [20:31:09] 10SRE, 10MediaWiki-General, 10Traffic, 10MW-1.39-notes (1.39.0-wmf.23; 2022-08-01), 10Patch-For-Review: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868 (10ori) [20:31:22] (03PS3) 10Clare Ming: Update sticky header config for idwiki, viwiki A/B experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823268 (https://phabricator.wikimedia.org/T312295) [20:33:21] (03PS1) 10Ottomata: airflow-dags/platform_eng - fix typo in scap source [puppet] - 10https://gerrit.wikimedia.org/r/823727 (https://phabricator.wikimedia.org/T312858) [20:34:49] dancy: quick Q - for my patch 823268 - i forgot to rebase before running scap backport -- now it seems to be stuck [20:35:03] stuck waiting for it to be merged? [20:35:12] stuck rebasing i think [20:35:12] (03PS2) 10Ottomata: airflow-dags/platform_eng - fix typo in scap source [puppet] - 10https://gerrit.wikimedia.org/r/823727 (https://phabricator.wikimedia.org/T312858) [20:35:38] (03CR) 10Ottomata: [V: 03+2 C: 03+2] airflow-dags/platform_eng - fix typo in scap source [puppet] - 10https://gerrit.wikimedia.org/r/823727 (https://phabricator.wikimedia.org/T312858) (owner: 10Ottomata) [20:35:43] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1055.eqiad.wmnet with OS bullseye [20:35:49] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1055.eqiad.wmnet with OS bullseye completed: - elastic1055 (... [20:36:11] cjming: https://integration.wikimedia.org/zuul/#q=823268 shows that it is still undergoing post+2 checks [20:36:14] and it just finished [20:36:16] i ran scap backport, it borked, then i rebased, and removed TrainBranchBot's +2 -- oh, it appears to be going again [20:36:40] yup - I'll try again [20:37:01] (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823268 (https://phabricator.wikimedia.org/T312295) (owner: 10Clare Ming) [20:38:38] (03Merged) 10jenkins-bot: Update sticky header config for idwiki, viwiki A/B experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823268 (https://phabricator.wikimedia.org/T312295) (owner: 10Clare Ming) [20:39:12] !log cjming@deploy1002 Started scap: Backport for [[gerrit:823268]] Update sticky header config for idwiki, viwiki A/B experiment [20:39:52] !log otto@deploy1002 Started deploy [airflow-dags/platform_eng@eba3ff8]: initial scap deploy to an-airflow1004 - T312858 [20:39:55] T312858: New airflow instance related to Image Suggestion Jobs - https://phabricator.wikimedia.org/T312858 [20:40:10] (03PS1) 10Dzahn: Revert "Revert "site: add phabricator role to phab2002"" [puppet] - 10https://gerrit.wikimedia.org/r/823636 [20:42:22] !log otto@deploy1002 Finished deploy [airflow-dags/platform_eng@eba3ff8]: initial scap deploy to an-airflow1004 - T312858 (duration: 02m 30s) [20:43:08] PROBLEM - Check systemd state on an-airflow1004 is CRITICAL: CRITICAL - degraded: The following units failed: ifup@ens13.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:56] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:823268]] Update sticky header config for idwiki, viwiki A/B experiment (duration: 06m 44s) [20:47:35] 10SRE, 10Performance-Team, 10observability: Add monitoring for performance.wikimedia.org - https://phabricator.wikimedia.org/T277927 (10Dzahn) Since observability wants to move all Icinga checks to Prometheus/Alertmanager there is probably little point in trying to add an Icinga virtual host `performance.wik... [20:47:57] !log end of UTC late backport window [20:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:02] jouncebot nowandnext [20:50:02] For the next 0 hour(s) and 9 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220816T2000) [20:50:03] In 10 hour(s) and 9 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220817T0700) [20:50:20] OK. Rolling wmf.25 to testwikis again. [20:50:25] 10SRE, 10Performance-Team, 10observability: Add monitoring for performance.wikimedia.org - https://phabricator.wikimedia.org/T277927 (10Dzahn) The first thing needed here is to define a "receiver". This will define what will happen if an alert triggers. Actions can be "send email", "create phab ticket", "no... [20:52:07] (03PS1) 10TrainBranchBot: testwikis wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823732 (https://phabricator.wikimedia.org/T314186) [20:52:09] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823732 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [20:53:00] !log otto@deploy1002 Started deploy [airflow-dags/platform_eng@da511ee]: initial scap deploy to an-airflow1004, take 2 - T312858 [20:53:03] T312858: New airflow instance related to Image Suggestion Jobs - https://phabricator.wikimedia.org/T312858 [20:53:05] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823732 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [20:53:55] !log dancy@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.25 refs T314186 [20:53:59] T314186: 1.39.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T314186 [20:54:05] !log otto@deploy1002 Finished deploy [airflow-dags/platform_eng@da511ee]: initial scap deploy to an-airflow1004, take 2 - T312858 (duration: 01m 05s) [20:54:32] 10SRE, 10Performance-Team, 10observability: Add monitoring for performance.wikimedia.org - https://phabricator.wikimedia.org/T277927 (10Dzahn) Turns out you already have this: ` - name: 'perf-ircmail' webhook_configs: - url: 'http://<%= @active_host %>:19190/wikimedia-perf-bots' email_configs:... [20:56:44] 10Puppet, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Traffic: Evaluation Error on deployment-cache-text06 puppet run - https://phabricator.wikimedia.org/T315351 (10RhinosF1) p:05Triage→03Unbreak! Hi Traffic, this might be stopping beta coming back up (or a false alarm). Can you take... [21:01:57] !log dancy@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.25 refs T314186 (duration: 08m 02s) [21:02:01] T314186: 1.39.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T314186 [21:05:37] !log otto@deploy1002 Started deploy [airflow-dags/platform_eng@33afb85]: initial scap deploy to an-airflow1004, take 3 - T312858 [21:05:41] T312858: New airflow instance related to Image Suggestion Jobs - https://phabricator.wikimedia.org/T312858 [21:05:56] !log otto@deploy1002 Finished deploy [airflow-dags/platform_eng@33afb85]: initial scap deploy to an-airflow1004, take 3 - T312858 (duration: 00m 18s) [21:10:39] 10SRE, 10Scap: Deploy error: insufficient permission for adding an object to repository database .git/objects - https://phabricator.wikimedia.org/T187076 (10dancy) 05Open→03Resolved a:03dancy Closing due to age. [21:12:00] (03PS1) 10Dzahn: webperf: add prometheus::blackbox::check::http for performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/823737 (https://phabricator.wikimedia.org/T277927) [21:12:42] PROBLEM - ElasticSearch numbers of masters eligible - 9643 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [21:13:10] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic1075.eqiad.wmnet with OS bullseye [21:13:16] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host elastic1075.eqiad.wmnet with OS bullseye [21:15:49] (03PS2) 10Aklapper: Disable journald messages' rate limiting [debs/pybal] - 10https://gerrit.wikimedia.org/r/418866 (https://phabricator.wikimedia.org/T189290) (owner: 10Vgutierrez) [21:16:14] 10SRE, 10PyBal, 10Traffic-Icebox, 10Patch-For-Review: Tune systemd journal rate limiting for PyBal - https://phabricator.wikimedia.org/T189290 (10Aklapper) 05Stalled→03Open [21:17:28] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: kubernetes202[01] implementation tracking - https://phabricator.wikimedia.org/T313871 (10Papaul) @akosiaris we have already kubernetes202[01] so we have to use kubernetes202[34] Thanks [21:18:38] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes202[34] - https://phabricator.wikimedia.org/T313870 (10Papaul) [21:25:35] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [21:25:55] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1075.eqiad.wmnet with reason: host reimage [21:26:06] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1025.eqiad.wmnet with OS bullseye [21:26:23] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host graphite2004 [21:27:13] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host graphite2004 [21:27:58] (03PS1) 10Bartosz Dziewoński: Make DiscussionTools replytool, newtopictool opt-out on ruwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823746 (https://phabricator.wikimedia.org/T297410) [21:28:35] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1075.eqiad.wmnet with reason: host reimage [21:29:19] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:29:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host graphite2004.mgmt.codfw.wmnet with reboot policy FORCED [21:31:23] !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts elastic1048.eqiad.wmnet [21:34:45] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T315352 (10wiki_willy) a:03Cmjohnson [21:35:46] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T314998 (10wiki_willy) a:03Cmjohnson [21:36:43] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T315344 (10wiki_willy) a:03Cmjohnson [21:37:41] (03PS1) 10Bking: elastic: decom elastic1048 [puppet] - 10https://gerrit.wikimedia.org/r/823747 (https://phabricator.wikimedia.org/T309810) [21:39:41] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/823747 (https://phabricator.wikimedia.org/T309810) (owner: 10Bking) [21:39:49] (03PS1) 10Andrea Denisse: quickdatacopy: Added simple username/groupname mapping for the Rsync server [puppet] - 10https://gerrit.wikimedia.org/r/823748 (https://phabricator.wikimedia.org/T314972) [21:40:07] (03CR) 10Ryan Kemper: [C: 03+1] elastic: decom elastic1048 [puppet] - 10https://gerrit.wikimedia.org/r/823747 (https://phabricator.wikimedia.org/T309810) (owner: 10Bking) [21:41:00] (03CR) 10Bking: [C: 03+2] elastic: decom elastic1048 [puppet] - 10https://gerrit.wikimedia.org/r/823747 (https://phabricator.wikimedia.org/T309810) (owner: 10Bking) [21:41:50] RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [21:41:58] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=99) for hosts elastic1048.eqiad.wmnet [21:42:00] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:42:06] !log bking@cumin1001 START - Cookbook sre.hosts.decommission for hosts elastic1048.eqiad.wmnet [21:44:43] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1025.eqiad.wmnet with OS bullseye [21:44:56] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1025.eqiad.wmnet with OS bullseye [21:45:02] (03PS1) 10Bartosz Dziewoński: Make DiscussionTools topicsubscription opt-out on A/B test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823749 (https://phabricator.wikimedia.org/T314693) [21:45:03] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1025.eqiad.wmnet with OS bullseye [21:45:22] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1025.eqiad.wmnet with OS bullseye [21:45:32] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1075.eqiad.wmnet with OS bullseye [21:45:40] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host elastic1075.eqiad.wmnet with OS bullseye completed: - elastic1055 (... [21:46:46] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:48] (03PS1) 10TrainBranchBot: group0 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823751 (https://phabricator.wikimedia.org/T314186) [21:46:50] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823751 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [21:47:18] !log bking@cumin1001 START - Cookbook sre.dns.netbox [21:49:53] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823751 (https://phabricator.wikimedia.org/T314186) (owner: 10TrainBranchBot) [21:53:18] !log bking@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:53:19] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1048.eqiad.wmnet [21:53:38] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:53:52] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:53:59] !log demon@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.25 refs T314186 [21:54:02] T314186: 1.39.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T314186 [21:56:23] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1025.eqiad.wmnet with OS bullseye [21:56:33] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1025.eqiad.wmnet with OS bullseye [21:59:09] (03PS1) 10Andrea Denisse: netmon: Set correct username/groupname mappings for LibreNMS [puppet] - 10https://gerrit.wikimedia.org/r/823752 (https://phabricator.wikimedia.org/T314972) [21:59:34] <^demon> `MWException: Error contacting the Parsoid/RESTBase server (HTTP 404)` Hmmm? [22:00:27] <^demon> (from /rpc/RunSingleJob.php) [22:04:12] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10aaron) [22:04:32] you'll need to provide some more details [22:04:53] but my guess would be that the job simply referred to a page that has been deleted in the meantime [22:05:00] (03PS1) 10Dzahn: phabricator: move lvs::realserver inclusion to profile, depend on vcs_enabled [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) [22:07:00] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1025.eqiad.wmnet with reason: host reimage [22:07:05] dunno if wikibugs is being a bit slow, but T315383 [22:07:05] T315383: MWException: Error contacting the Parsoid/RESTBase server (HTTP 404) - https://phabricator.wikimedia.org/T315383 [22:10:27] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1025.eqiad.wmnet with reason: host reimage [22:11:20] (03PS1) 10Bartosz Dziewoński: Enable visual editor in Project: (Wikipedia:) namespace at viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823757 (https://phabricator.wikimedia.org/T314968) [22:12:25] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:12:33] (03PS2) 10Bartosz Dziewoński: Enable visual editor in Project: (Wikipedia:) namespace on viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823757 (https://phabricator.wikimedia.org/T314968) [22:12:56] (03PS1) 10Bartosz Dziewoński: Enable wgCiteResponsiveReferences on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823758 [22:13:16] (03PS2) 10Bartosz Dziewoński: Enable wgCiteResponsiveReferences on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/823758 (https://phabricator.wikimedia.org/T315333) [22:14:00] (03CR) 10Dzahn: "should be noop but does have this diff.. which should just be the order of packages:" [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:15:37] (03CR) 10Dzahn: [C: 04-2] "first https://gerrit.wikimedia.org/r/c/operations/puppet/+/823755" [puppet] - 10https://gerrit.wikimedia.org/r/823636 (owner: 10Dzahn) [22:15:37] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:16:06] (03PS1) 10Andrea Denisse: netmon: Set correct username/groupname mappings for Rancid [puppet] - 10https://gerrit.wikimedia.org/r/823759 (https://phabricator.wikimedia.org/T314972) [22:16:52] (03CR) 10Brennen Bearnes: [C: 03+1] "_Seems_ reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/823755 (https://phabricator.wikimedia.org/T280597) (owner: 10Dzahn) [22:16:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Large deletions affecting this replica [22:17:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Large deletions affecting this replica [22:20:23] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1002/36769/" [puppet] - 10https://gerrit.wikimedia.org/r/823759 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [22:21:15] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1001/36767/" [puppet] - 10https://gerrit.wikimedia.org/r/823752 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [22:21:43] (03CR) 10Andrea Denisse: "PCC Results: https://puppet-compiler.wmflabs.org/pcc-worker1003/36766/" [puppet] - 10https://gerrit.wikimedia.org/r/823748 (https://phabricator.wikimedia.org/T314972) (owner: 10Andrea Denisse) [22:29:50] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1025.eqiad.wmnet with OS bullseye [22:30:03] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1025.eqiad.wmnet with OS bullseye [22:31:55] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1025.eqiad.wmnet with OS bullseye [22:31:58] PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:34:38] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) [22:36:54] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10andrea.denisse) Hello team, I submitted the following patches for this issue: 1. [[ https... [22:37:36] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) [22:37:43] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10andrea.denisse) [22:42:48] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1025.eqiad.wmnet with OS bullseye [22:45:06] RECOVERY - Check systemd state on search-loader2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:46:57] (03PS2) 10Andrea Denisse: quickdatacopy: Added simple username/groupname mapping for the Rsync server [puppet] - 10https://gerrit.wikimedia.org/r/823748 (https://phabricator.wikimedia.org/T314972) [22:47:14] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1025.eqiad.wmnet with OS bullseye [22:47:15] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcephosd1026.eqiad.wmnet with OS bullseye [22:47:27] (03PS1) 10Zabe: role::puppetmaster::standalone: Remove apache2 standard ports [puppet] - 10https://gerrit.wikimedia.org/r/823762 [22:47:59] (03PS2) 10Zabe: role::puppetmaster::standalone: Remove apache2 standard ports [puppet] - 10https://gerrit.wikimedia.org/r/823762 [22:48:53] (03CR) 10CI reject: [V: 04-1] role::puppetmaster::standalone: Remove apache2 standard ports [puppet] - 10https://gerrit.wikimedia.org/r/823762 (owner: 10Zabe) [22:50:32] (03PS3) 10Andrea Denisse: quickdatacopy: Added simple username/groupname mapping for the Rsync server [puppet] - 10https://gerrit.wikimedia.org/r/823748 (https://phabricator.wikimedia.org/T314972) [22:50:34] (03PS3) 10Zabe: role::puppetmaster::standalone: Remove apache2 standard ports [puppet] - 10https://gerrit.wikimedia.org/r/823762 [22:52:24] PROBLEM - Check systemd state on elastic1079 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:59:20] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1025.eqiad.wmnet with reason: host reimage [22:59:49] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1026.eqiad.wmnet with reason: host reimage [23:01:48] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host graphite2004.mgmt.codfw.wmnet with reboot policy FORCED [23:02:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1025.eqiad.wmnet with reason: host reimage [23:04:31] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10Dzahn) @andrea.denisse Also see https://wikitech.wikimedia.org/wiki/UID [23:04:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1026.eqiad.wmnet with reason: host reimage [23:07:01] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 7 day(s) (Wed 24 Aug 2022 07:48:40 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:08:07] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 23 Oct 2022 06:50:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:09:39] ^ we already have a ticket or 2 about that [23:19:23] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1026.eqiad.wmnet with OS bullseye [23:20:21] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [23:20:30] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['graphite2004'] [23:21:14] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['graphite2004'] [23:22:57] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['graphite2004'] [23:23:03] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10Dzahn) @andrea.denisse I ran into this a bunch of times before. If you want to "reserve" a UID but also... [23:23:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['graphite2004'] [23:24:34] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:26:27] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-logging2004 [23:27:04] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-logging2004 [23:27:24] (03PS1) 10Ori: BETA CLUSTER: Revert "trafficserver: 9.x upgrade: install ATS 9.x from component" [puppet] - 10https://gerrit.wikimedia.org/r/823638 [23:27:37] !log pt1979@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host kafka-logging2005 [23:28:08] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kafka-logging2005 [23:28:14] (03CR) 10CI reject: [V: 04-1] BETA CLUSTER: Revert "trafficserver: 9.x upgrade: install ATS 9.x from component" [puppet] - 10https://gerrit.wikimedia.org/r/823638 (owner: 10Ori) [23:28:48] 10Puppet, 10SRE, 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Traffic: Evaluation Error on deployment-cache-text06 puppet run - https://phabricator.wikimedia.org/T315351 (10TheresNoTime) Introduced by https://gerrit.wikimedia.org/r/c/operations/puppet/+/816806 ? `lang=diff diff --git a... [23:31:05] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['graphite2004'] [23:31:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['graphite2004'] [23:32:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-logging2004.mgmt.codfw.wmnet with reboot policy FORCED [23:33:55] 10SRE, 10ops-codfw, 10DC-Ops, 10observability: Q1:rack/setup/install graphite2004 - https://phabricator.wikimedia.org/T313851 (10Papaul) [23:37:12] (03PS1) 10Ori: BETA CLUSTER: Revert "esitest service for cache nodes" [puppet] - 10https://gerrit.wikimedia.org/r/823639 [23:37:44] !log phab2002 - chown -R phd:www-data /srv/repos/ (because of UID mismatch) T313360 [23:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:37:48] T313360: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 [23:39:46] (03CR) 10CI reject: [V: 04-1] BETA CLUSTER: Revert "esitest service for cache nodes" [puppet] - 10https://gerrit.wikimedia.org/r/823639 (owner: 10Ori) [23:40:16] (03CR) 10Zabe: "has been cherry-picked to beta cluster in order to make puppet run on puppetmaster again" [puppet] - 10https://gerrit.wikimedia.org/r/823762 (owner: 10Zabe) [23:44:17] !log phab1001 - repeated rsync of /srv/repos to phab2002, then chown -R phd /srv/repos/ (without setting the group) - this way UID is fixed and privs match exactly phab1001 - T313360 [23:44:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:21] T313360: Setup rsync for phab data on disk - https://phabricator.wikimedia.org/T313360 [23:49:49] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/puppet/%2B/666133/5/modules/admin/data/data.yaml [23:53:05] (03PS1) 10Andrea Denisse: netmon: Create LibreNMS logs file. [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T309074) [23:53:37] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kafka-logging2004.mgmt.codfw.wmnet with reboot policy FORCED [23:53:42] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:52] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:59] (03CR) 10CI reject: [V: 04-1] netmon: Create LibreNMS logs file. [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [23:55:21] (03PS1) 10Dzahn: phabricator::migration: add phd user with sysmted::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) [23:55:50] (03PS2) 10Dzahn: phabricator::migration: add phd user with systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) [23:56:34] (03CR) 10CI reject: [V: 04-1] phabricator::migration: add phd user with systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/823765 (https://phabricator.wikimedia.org/T313360) (owner: 10Dzahn) [23:56:36] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kafka-logging2005.mgmt.codfw.wmnet with reboot policy FORCED [23:56:42] (03PS2) 10Andrea Denisse: netmon: Create LibreNMS logs file. [puppet] - 10https://gerrit.wikimedia.org/r/823764 (https://phabricator.wikimedia.org/T309074) [23:57:27] (03PS1) 10Ori: beta cluster: don't instantiate ::esitest [puppet] - 10https://gerrit.wikimedia.org/r/823766 (https://phabricator.wikimedia.org/T315350) [23:59:27] (03PS1) 10Dzahn: phabricator: replace user{} with systemd::sysuser for daemon user [puppet] - 10https://gerrit.wikimedia.org/r/823767 (https://phabricator.wikimedia.org/T313360)