[00:22:20] 10SRE, 10Privacy Engineering, 10Research, 10Security-Team, and 3 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10bmansurov) 05Open→03Resolved It seems the majority of the issues described in the task have been res... [00:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:43:02] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220607T0100) [01:16:29] (03PS1) 10Papaul: Testing partman for clouddumps node [puppet] - 10https://gerrit.wikimedia.org/r/803373 (https://phabricator.wikimedia.org/T302981) [01:18:18] (03CR) 10Papaul: [C: 03+2] Testing partman for clouddumps node [puppet] - 10https://gerrit.wikimedia.org/r/803373 (https://phabricator.wikimedia.org/T302981) (owner: 10Papaul) [01:24:06] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 26802249752 and 54514 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:32:32] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [01:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:49] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on clouddumps1001.wikimedia.org with reason: host reimage [01:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:45:10] 10SRE, 10Privacy Engineering, 10Research, 10Security-Team, and 4 others: wikiworkshop.org has Facebook button, external statcounter, https to http redirect - https://phabricator.wikimedia.org/T251732 (10sbassett) [01:48:36] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host clouddumps1001.wikimedia.org with OS bullseye [01:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 3 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host clouddumps1001.wikimedia.org with... [02:06:07] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [02:06:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.15 [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803382 [02:07:21] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.15 [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803382 (owner: 10TrainBranchBot) [02:07:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:07:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:09:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:26:00] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.15 [core] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/803382 (owner: 10TrainBranchBot) [02:30:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:30:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:50:43] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 77396606976 and 1850 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:51:55] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 79585981968 and 1921 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:05:05] RECOVERY - SSH on druid1006.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:07:23] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:32:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298560)', diff saved to https://phabricator.wikimedia.org/P29444 and previous config saved to /var/cache/conftool/dbconfig/20220607-033206-ladsgroup.json [03:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:11] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [03:33:41] PROBLEM - Host mr1-drmrs is DOWN: PING CRITICAL - Packet loss = 100% [03:36:11] PROBLEM - Host mr1-drmrs.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:36:11] PROBLEM - Host mr1-drmrs IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:40:05] RECOVERY - Host mr1-drmrs is UP: PING WARNING - Packet loss = 60%, RTA = 87.78 ms [03:40:46] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:42:33] RECOVERY - Host mr1-drmrs.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 278.02 ms [03:42:33] RECOVERY - Host mr1-drmrs IPv6 is UP: PING OK - Packet loss = 0%, RTA = 87.49 ms [03:45:46] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@drmrs - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:47:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P29445 and previous config saved to /var/cache/conftool/dbconfig/20220607-034711-ladsgroup.json [03:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:50:57] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:02:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P29446 and previous config saved to /var/cache/conftool/dbconfig/20220607-040216-ladsgroup.json [04:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:05:41] (03CR) 10KartikMistry: Update cxserver to 2022-05-31-123738-production (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [04:17:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298560)', diff saved to https://phabricator.wikimedia.org/P29447 and previous config saved to /var/cache/conftool/dbconfig/20220607-041721-ladsgroup.json [04:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:17:25] T298560: Fix mismatching field type of revision.rev_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298560 [04:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:01:48] (03PS1) 10Marostegui: Revert "es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/802961 [05:08:48] (03CR) 10Marostegui: [C: 03+2] Revert "es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/802961 (owner: 10Marostegui) [05:14:18] (03CR) 10Marostegui: "Thanks Alex" [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [05:15:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool es2031 T309977', diff saved to https://phabricator.wikimedia.org/P29449 and previous config saved to /var/cache/conftool/dbconfig/20220607-051525-marostegui.json [05:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:31] T309977: es2031 crashed (es2) - https://phabricator.wikimedia.org/T309977 [05:25:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [05:25:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1168.eqiad.wmnet with reason: Maintenance [05:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T310011)', diff saved to https://phabricator.wikimedia.org/P29450 and previous config saved to /var/cache/conftool/dbconfig/20220607-052522-marostegui.json [05:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:26] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [05:28:39] (03PS1) 10Marostegui: change_cuc_timestamp_T310011.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/803386 (https://phabricator.wikimedia.org/T310011) [05:29:56] (03CR) 10Marostegui: [C: 03+2] change_cuc_timestamp_T310011.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/803386 (https://phabricator.wikimedia.org/T310011) (owner: 10Marostegui) [05:30:17] PROBLEM - ElasticSearch unassigned shard check - 9200 on cloudelastic1001 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:30:21] (03Merged) 10jenkins-bot: change_cuc_timestamp_T310011.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/803386 (https://phabricator.wikimedia.org/T310011) (owner: 10Marostegui) [05:30:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T310011)', diff saved to https://phabricator.wikimedia.org/P29451 and previous config saved to /var/cache/conftool/dbconfig/20220607-053039-marostegui.json [05:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:30:45] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [05:32:01] (03PS1) 10Elukey: role::prometheus: fix port declaration for ml-staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/803387 [05:33:13] (03CR) 10Elukey: [C: 03+2] role::prometheus: fix port declaration for ml-staging k8s [puppet] - 10https://gerrit.wikimedia.org/r/803387 (owner: 10Elukey) [05:38:51] PROBLEM - ElasticSearch unassigned shard check - 9200 on cloudelastic1006 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:45:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P29452 and previous config saved to /var/cache/conftool/dbconfig/20220607-054544-marostegui.json [05:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:49:13] PROBLEM - ElasticSearch unassigned shard check - 9200 on cloudelastic1002 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [05:51:21] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:54:57] PROBLEM - ElasticSearch unassigned shard check - 9200 on cloudelastic1005 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:00:05] kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220607T0600). Please do the needful. [06:00:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P29453 and previous config saved to /var/cache/conftool/dbconfig/20220607-060049-marostegui.json [06:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T310011)', diff saved to https://phabricator.wikimedia.org/P29454 and previous config saved to /var/cache/conftool/dbconfig/20220607-061554-marostegui.json [06:15:56] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [06:15:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1180.eqiad.wmnet with reason: Maintenance [06:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:59] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T310011)', diff saved to https://phabricator.wikimedia.org/P29455 and previous config saved to /var/cache/conftool/dbconfig/20220607-061602-marostegui.json [06:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:19] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:19:37] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [06:21:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T310011)', diff saved to https://phabricator.wikimedia.org/P29456 and previous config saved to /var/cache/conftool/dbconfig/20220607-062120-marostegui.json [06:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:25] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [06:25:33] (03Abandoned) 10Mabualruz: Remove 6 deprecated ResourceLoader skin modules in core [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802578 (https://phabricator.wikimedia.org/T304322) (owner: 10Mabualruz) [06:33:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:36:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P29457 and previous config saved to /var/cache/conftool/dbconfig/20220607-063625-marostegui.json [06:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:29] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [06:48:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance elastic1080-production-search-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:49:17] PROBLEM - ElasticSearch unassigned shard check - 9200 on cloudelastic1003 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [06:49:55] (03PS2) 10WMDE-Fisch: Enable tag on remaining FlaggedRevision page-stabilized wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802732 (https://phabricator.wikimedia.org/T307348) [06:50:11] (03PS10) 10Slyngshede: Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) [06:51:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P29458 and previous config saved to /var/cache/conftool/dbconfig/20220607-065131-marostegui.json [06:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:41] (03CR) 10Slyngshede: [C: 03+2] Rewrite logster::job to use systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [06:54:51] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:55:01] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:00:04] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220607T0700). [07:00:04] WMDE-Fisch and koi: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:00:43] Hi! [07:00:58] \o [07:01:27] koi: You can go first, I still have to wait for a review by my team -.- [07:02:29] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [07:03:00] Thanks! Waiting for someone could deploy today [07:03:29] 10SRE, 10ops-codfw, 10DBA: es2031 crashed (es2) - https://phabricator.wikimedia.org/T309977 (10Marostegui) 05Open→03Resolved Closing as fixed, we'll reopen if it crashes again. [07:03:34] (03PS1) 10Muehlenhoff: Record LDAP access for evza [puppet] - 10https://gerrit.wikimedia.org/r/803390 (https://phabricator.wikimedia.org/T309700) [07:04:32] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] Move restart of slapd, due to memory leaks, to systemd timers. [puppet] - 10https://gerrit.wikimedia.org/r/792109 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:05:58] koi: Ah sure, I could do that. If nobody shows up :-). [07:06:04] I'll do it. [07:06:22] (03PS2) 10WMDE-Fisch: Revert "votewiki: Change wgLanguageCode to zh for May 2022 zhwiki admin election" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802833 (owner: 10Stang) [07:06:33] (03CR) 10Awight: [C: 03+1] "thrilling!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802732 (https://phabricator.wikimedia.org/T307348) (owner: 10WMDE-Fisch) [07:06:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T310011)', diff saved to https://phabricator.wikimedia.org/P29460 and previous config saved to /var/cache/conftool/dbconfig/20220607-070637-marostegui.json [07:06:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [07:06:41] Thanks a lot! [07:06:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1165.eqiad.wmnet with reason: Maintenance [07:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:42] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:06:42] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:06:44] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802736 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [07:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [07:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:48] Change seems to make sense, looking at the ticket and the revert. :-) [07:06:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T310011)', diff saved to https://phabricator.wikimedia.org/P29461 and previous config saved to /var/cache/conftool/dbconfig/20220607-070650-marostegui.json [07:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:53] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802833 (owner: 10Stang) [07:09:41] (03Merged) 10jenkins-bot: Revert "votewiki: Change wgLanguageCode to zh for May 2022 zhwiki admin election" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802833 (owner: 10Stang) [07:10:06] (03CR) 10Muehlenhoff: [C: 03+2] Record LDAP access for evza [puppet] - 10https://gerrit.wikimedia.org/r/803390 (https://phabricator.wikimedia.org/T309700) (owner: 10Muehlenhoff) [07:10:51] koi: Is on mwdebug1001 if you want to test it. [07:11:26] WMDE-Fisch: tested and LGTM [07:11:40] koi: Cool. Going forward. [07:12:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T310011)', diff saved to https://phabricator.wikimedia.org/P29462 and previous config saved to /var/cache/conftool/dbconfig/20220607-071207-marostegui.json [07:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:13] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:12:26] (03PS3) 10WMDE-Fisch: Enable tag on remaining FlaggedRevision page-stabilized wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802732 (https://phabricator.wikimedia.org/T307348) [07:14:01] (03CR) 10Muehlenhoff: raid: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802564 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [07:14:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:14:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:09] !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:802833|Revert "votewiki: Change wgLanguageCode to zh for May 2022 zhwiki admin election"]] (duration: 03m 02s) [07:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:20] koi: Done. :-) [07:15:32] thx! [07:15:38] And I got my review. \o/ [07:15:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:15:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:15] (03CR) 10WMDE-Fisch: [C: 03+2] "Deploy!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802732 (https://phabricator.wikimedia.org/T307348) (owner: 10WMDE-Fisch) [07:17:34] (03PS1) 10Muehlenhoff: Add three further contributors [puppet] - 10https://gerrit.wikimedia.org/r/803391 (https://phabricator.wikimedia.org/T308013) [07:18:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:18:21] (03Merged) 10jenkins-bot: Enable tag on remaining FlaggedRevision page-stabilized wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802732 (https://phabricator.wikimedia.org/T307348) (owner: 10WMDE-Fisch) [07:22:10] Note to self: Tested on debug. Looks good. Deploying. [07:22:27] (03CR) 10Muehlenhoff: [C: 03+2] Add three further contributors [puppet] - 10https://gerrit.wikimedia.org/r/803391 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [07:23:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:07] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:24:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:39] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1053.eqiad.wmnet with OS bullseye [07:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:43] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1053.eqiad.wmnet with OS bullseye [07:25:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:36] !log wmde-fisch@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:802732|Enable tag on remaining FlaggedRevision page-stabilized wikis (T307348)]] (duration: 03m 19s) [07:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:25:40] T307348: Enable Kartographer tag on remaining FlaggedRevision page-stabilized wikis - https://phabricator.wikimedia.org/T307348 [07:26:51] Morning backport window done! [07:27:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P29463 and previous config saved to /var/cache/conftool/dbconfig/20220607-072712-marostegui.json [07:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:50] (03CR) 10Ayounsi: [C: 03+1] "LGTM but please run PCC on alert1001 and one of the servers with the `prometheus::pop` role (to make sure it doesn't break the blackbox ex" [puppet] - 10https://gerrit.wikimedia.org/r/802499 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [07:35:17] (03PS1) 10Slyngshede: C:query_service::deploy::autodeploy move to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803393 [07:35:32] (03Abandoned) 10Slyngshede: Unused manifest and script deleted as part of cronjob cleanup. [puppet] - 10https://gerrit.wikimedia.org/r/791376 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [07:36:25] (03CR) 10CI reject: [V: 04-1] C:query_service::deploy::autodeploy move to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803393 (owner: 10Slyngshede) [07:37:14] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1053.eqiad.wmnet with reason: host reimage [07:37:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:26] (03PS2) 10Slyngshede: C:query_service::deploy::autodeploy move to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803393 [07:39:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1053.eqiad.wmnet with reason: host reimage [07:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:20] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35748/console" [puppet] - 10https://gerrit.wikimedia.org/r/803393 (owner: 10Slyngshede) [07:42:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P29464 and previous config saved to /var/cache/conftool/dbconfig/20220607-074218-marostegui.json [07:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:53] !log upgrading ganeti/esams to Ganeti 3 T308238 [07:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:57] T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 [07:45:23] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35749/console" [puppet] - 10https://gerrit.wikimedia.org/r/803393 (owner: 10Slyngshede) [07:46:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35750/console" [puppet] - 10https://gerrit.wikimedia.org/r/803393 (owner: 10Slyngshede) [07:47:53] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35751/console" [puppet] - 10https://gerrit.wikimedia.org/r/803393 (owner: 10Slyngshede) [07:49:05] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) [07:51:06] PROBLEM - Check systemd state on mwlog1002 is CRITICAL: CRITICAL - degraded: The following units failed: logster-badpass_priv.service,logster-csp.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:20] (03PS1) 10Aqu: Reference the pid file used by the scheduler.service [puppet] - 10https://gerrit.wikimedia.org/r/803396 [07:52:32] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35753/console" [puppet] - 10https://gerrit.wikimedia.org/r/803393 (owner: 10Slyngshede) [07:53:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1053.eqiad.wmnet with OS bullseye [07:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:14] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1053.eqiad.wmnet with OS bullseye completed: - ms-be1053 (**PASS**) - Downtim... [07:55:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/803317 (https://phabricator.wikimedia.org/T309447) (owner: 10Volans) [07:55:06] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:57:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T310011)', diff saved to https://phabricator.wikimedia.org/P29465 and previous config saved to /var/cache/conftool/dbconfig/20220607-075723-marostegui.json [07:57:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:27] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1096.eqiad.wmnet with reason: Maintenance [07:57:28] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [07:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T310011)', diff saved to https://phabricator.wikimedia.org/P29466 and previous config saved to /var/cache/conftool/dbconfig/20220607-075731-marostegui.json [07:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:42] (03PS3) 10Slyngshede: C:query_service::deploy::autodeploy remove used autodeploy. [puppet] - 10https://gerrit.wikimedia.org/r/803393 [08:08:39] 10ops-eqiad: Failed PSU on ganeti1023 - https://phabricator.wikimedia.org/T310041 (10MoritzMuehlenhoff) [08:08:52] 10ops-eqiad: Failed PSU on ganeti1023 - https://phabricator.wikimedia.org/T310041 (10MoritzMuehlenhoff) p:05Triage→03Medium [08:09:18] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:09:31] ACKNOWLEDGEMENT - IPMI Sensor Status on ganeti1023 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] Muehlenhoff T310041 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [08:09:52] PROBLEM - SSH on ms-be1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:10:59] (03CR) 10Slyngshede: "It doesn't look like the auto deploy mode is being used." [puppet] - 10https://gerrit.wikimedia.org/r/803393 (owner: 10Slyngshede) [08:13:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T310011)', diff saved to https://phabricator.wikimedia.org/P29467 and previous config saved to /var/cache/conftool/dbconfig/20220607-081318-marostegui.json [08:13:48] RECOVERY - SSH on ms-be1066 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:14:18] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:14:37] !log gnt-cluster upgrade --to 3.0 for ganeti/esams T308238 [08:15:54] (03PS5) 10Slyngshede: P::aptrepo::wikimedia install Apache for private repo. [puppet] - 10https://gerrit.wikimedia.org/r/802445 [08:17:55] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) [08:18:47] (03PS2) 10Aqu: Reference the pid file used by the scheduler.service [puppet] - 10https://gerrit.wikimedia.org/r/803396 [08:19:49] FYI I'm merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/799870 which will roll-restart rsyslog [08:20:08] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] rsyslog: bound disk-assisted queues [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) (owner: 10Filippo Giunchedi) [08:20:13] (03PS2) 10Filippo Giunchedi: rsyslog: bound disk-assisted queues [puppet] - 10https://gerrit.wikimedia.org/r/799870 (https://phabricator.wikimedia.org/T308439) [08:20:29] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35754/console" [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [08:20:53] (03CR) 10Physikerwelt: "I don't have cluster access so I can only explain what needs to be done in theory." [deployment-charts] - 10https://gerrit.wikimedia.org/r/803305 (owner: 10PipelineBot) [08:21:22] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35755/console" [puppet] - 10https://gerrit.wikimedia.org/r/802823 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [08:21:43] (03CR) 10CI reject: [V: 04-1] Reference the pid file used by the scheduler.service [puppet] - 10https://gerrit.wikimedia.org/r/803396 (owner: 10Aqu) [08:24:34] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you for the to_yaml fix, I think this is basically good to be merged and tried out more widely (modulo comments e.g. dashboard)" [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [08:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:26:32] !log set on-disk max queue size for rsyslog fleet wide - T308439 [08:26:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:36] T308439: rsyslog disk spool files filled the filesystem on thanos-fe2001 - https://phabricator.wikimedia.org/T308439 [08:28:17] (03PS3) 10Aqu: Reference the pid file used by the scheduler.service [puppet] - 10https://gerrit.wikimedia.org/r/803396 [08:28:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P29468 and previous config saved to /var/cache/conftool/dbconfig/20220607-082823-marostegui.json [08:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:15] (03CR) 10Slyngshede: [V: 03+1] P::aptrepo::wikimedia install Apache for private repo. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [08:29:54] !log drain ganeti3003 for reimage T308238 [08:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:58] T308238: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 [08:31:06] (03CR) 10CI reject: [V: 04-1] Reference the pid file used by the scheduler.service [puppet] - 10https://gerrit.wikimedia.org/r/803396 (owner: 10Aqu) [08:32:51] (03PS4) 10Aqu: Reference the pid file used by the scheduler.service [puppet] - 10https://gerrit.wikimedia.org/r/803396 (https://phabricator.wikimedia.org/T310042) [08:34:26] (03PS6) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [08:39:44] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab::dump: delete role and profile classes [puppet] - 10https://gerrit.wikimedia.org/r/802823 (https://phabricator.wikimedia.org/T274463) (owner: 10Dzahn) [08:40:03] (03PS6) 10Slyngshede: P::aptrepo::wikimedia install Apache for private repo. [puppet] - 10https://gerrit.wikimedia.org/r/802445 [08:40:58] (03CR) 10CI reject: [V: 04-1] P::aptrepo::wikimedia install Apache for private repo. [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [08:42:22] (03CR) 10Volans: "reply inline" [puppet] - 10https://gerrit.wikimedia.org/r/803340 (owner: 10Nskaggs) [08:42:51] (03PS7) 10Slyngshede: P::aptrepo::wikimedia install Apache for private repo. [puppet] - 10https://gerrit.wikimedia.org/r/802445 [08:43:12] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [08:43:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P29469 and previous config saved to /var/cache/conftool/dbconfig/20220607-084328-marostegui.json [08:43:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:46] (03CR) 10CI reject: [V: 04-1] P::aptrepo::wikimedia install Apache for private repo. [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [08:44:51] (03PS8) 10Slyngshede: P::aptrepo::wikimedia install Apache for private repo. [puppet] - 10https://gerrit.wikimedia.org/r/802445 [08:48:58] (03CR) 10David Caro: [C: 03+2] ceph: fix regex to match dbg/dbgsym [puppet] - 10https://gerrit.wikimedia.org/r/802736 (https://phabricator.wikimedia.org/T309786) (owner: 10David Caro) [08:52:20] (03CR) 10Slyngshede: P::aptrepo::wikimedia install Apache for private repo. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [08:57:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [08:58:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T310011)', diff saved to https://phabricator.wikimedia.org/P29470 and previous config saved to /var/cache/conftool/dbconfig/20220607-085833-marostegui.json [08:58:35] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [08:58:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:37] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [08:58:37] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [08:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. We can also test this after merging by e.g. depooling both codfw replicas (which see very, very little traffic)" [puppet] - 10https://gerrit.wikimedia.org/r/802071 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [08:59:34] (03PS1) 10Jbond: netbox: Add SANs for addtional vhosts [puppet] - 10https://gerrit.wikimedia.org/r/803456 (https://phabricator.wikimedia.org/T296452) [09:00:12] 10SRE, 10SRE-Access-Requests: Requesting access to ores-admins for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey) [09:00:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35756/console" [puppet] - 10https://gerrit.wikimedia.org/r/803456 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:00:29] (03CR) 10Slyngshede: [C: 03+2] P::aptrepo::wikimedia install Apache for private repo. [puppet] - 10https://gerrit.wikimedia.org/r/802445 (owner: 10Slyngshede) [09:00:53] (03CR) 10Muehlenhoff: sre: update renamed otrs role to vrts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802579 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [09:01:44] 10SRE, 10SRE-Access-Requests: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey) [09:02:50] (03PS1) 10Elukey: admin: add ml-team-admins to ores-admin by default [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044) [09:03:03] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35757/console" [puppet] - 10https://gerrit.wikimedia.org/r/803456 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:03:48] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10elukey) @calbon could you please review and approve? :) [09:03:52] (03CR) 10CI reject: [V: 04-1] admin: add ml-team-admins to ores-admin by default [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044) (owner: 10Elukey) [09:05:14] (03PS1) 10Jbond: netbox: switch netbox to new infrastructure [dns] - 10https://gerrit.wikimedia.org/r/803459 (https://phabricator.wikimedia.org/T296452) [09:05:16] (03PS1) 10Jbond: netbox: increase TTL to 1D [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452) [09:06:01] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35758/console" [puppet] - 10https://gerrit.wikimedia.org/r/803456 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:06:56] (03PS2) 10Elukey: admin: add ml-team-admins to ores-admin by default [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044) [09:07:15] (03PS2) 10Jbond: netbox: switch netbox to new infrastructure [dns] - 10https://gerrit.wikimedia.org/r/803459 (https://phabricator.wikimedia.org/T296452) [09:07:46] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:07:48] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2129.codfw.wmnet with reason: Maintenance [09:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: Maintenance [09:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: Maintenance [09:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:27] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid test cluster: Roll restart of Druid jvm daemons. [09:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:42] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35759/console" [puppet] - 10https://gerrit.wikimedia.org/r/803396 (https://phabricator.wikimedia.org/T310042) (owner: 10Aqu) [09:10:12] (03PS3) 10Jbond: netbox: switch netbox to new infrastructure [dns] - 10https://gerrit.wikimedia.org/r/803459 (https://phabricator.wikimedia.org/T296452) [09:10:14] (03PS1) 10Jbond: netbox: decrease TTL to 5m for fail over [dns] - 10https://gerrit.wikimedia.org/r/803462 (https://phabricator.wikimedia.org/T296452) [09:11:42] (03PS2) 10Jbond: netbox: increase TTL to 1D [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452) [09:11:56] (03CR) 10Jbond: [C: 03+2] netbox: decrease TTL to 5m for fail over [dns] - 10https://gerrit.wikimedia.org/r/803462 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:12:48] (03CR) 10Elukey: "The only diff in permissions for this code review are related to Aiko, since Chris and Kevin are already in ores-admin." [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044) (owner: 10Elukey) [09:13:36] (03PS4) 10Jbond: netbox: switch netbox to new infrastructure [dns] - 10https://gerrit.wikimedia.org/r/803459 (https://phabricator.wikimedia.org/T296452) [09:13:49] (03PS3) 10Jbond: netbox: increase TTL to 1D [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452) [09:15:35] (03PS1) 10Slyngshede: profile::aptrepo::wikimedia Fix missing vhost variables. [puppet] - 10https://gerrit.wikimedia.org/r/803463 [09:16:04] (03PS2) 10Jbond: netbox: Add SANs for addtional vhosts [puppet] - 10https://gerrit.wikimedia.org/r/803456 (https://phabricator.wikimedia.org/T296452) [09:17:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:17:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1098.eqiad.wmnet with reason: Maintenance [09:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T310011)', diff saved to https://phabricator.wikimedia.org/P29471 and previous config saved to /var/cache/conftool/dbconfig/20220607-091716-marostegui.json [09:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:21] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [09:17:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid test cluster: Roll restart of Druid jvm daemons. [09:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:13] (03PS2) 10Slyngshede: profile::aptrepo::wikimedia Fix missing vhost variables. [puppet] - 10https://gerrit.wikimedia.org/r/803463 [09:18:49] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35761/console" [puppet] - 10https://gerrit.wikimedia.org/r/803396 (https://phabricator.wikimedia.org/T310042) (owner: 10Aqu) [09:19:14] (03CR) 10CI reject: [V: 04-1] netbox: Add SANs for addtional vhosts [puppet] - 10https://gerrit.wikimedia.org/r/803456 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:19:42] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "In theory we could do this only for the Commons dumps, but it’s probably less confusing if we just do it for all dumps together." [puppet] - 10https://gerrit.wikimedia.org/r/802921 (https://phabricator.wikimedia.org/T301104) (owner: 10Mitar) [09:21:40] (03CR) 10Volans: "Did a first pass as requested" [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [09:26:44] (03PS3) 10Jbond: netbox: Add SANs for addtional vhosts [puppet] - 10https://gerrit.wikimedia.org/r/803456 (https://phabricator.wikimedia.org/T296452) [09:26:48] (03PS4) 10Muehlenhoff: arclamp: add rsync config to migrate Xenon data [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460) [09:27:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35762/console" [puppet] - 10https://gerrit.wikimedia.org/r/803456 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:29:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T310011)', diff saved to https://phabricator.wikimedia.org/P29472 and previous config saved to /var/cache/conftool/dbconfig/20220607-092953-marostegui.json [09:29:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:58] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [09:32:20] (03CR) 10Jbond: utils: Add small script to set up bundler (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803341 (owner: 10Jbond) [09:34:32] PROBLEM - ElasticSearch unassigned shard check - 9200 on cloudelastic1004 is CRITICAL: CRITICAL - commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z), commonswiki_file_1647920262[11](2022-05-31T16:52:02.429Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [09:36:18] (03PS3) 10Slyngshede: profile::aptrepo::wikimedia Fix missing vhost variables. [puppet] - 10https://gerrit.wikimedia.org/r/803463 [09:38:13] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35764/console" [puppet] - 10https://gerrit.wikimedia.org/r/803396 (https://phabricator.wikimedia.org/T310042) (owner: 10Aqu) [09:40:41] (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox: Add SANs for addtional vhosts [puppet] - 10https://gerrit.wikimedia.org/r/803456 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:41:22] (03PS4) 10Slyngshede: profile::aptrepo::wikimedia Fix missing vhost variables. [puppet] - 10https://gerrit.wikimedia.org/r/803463 [09:44:32] (03CR) 10Mitar: Add page metadata to Wikibase JSON dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802921 (https://phabricator.wikimedia.org/T301104) (owner: 10Mitar) [09:44:38] (03PS1) 10Jbond: netbox: update netbox to use internal discovery address [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) [09:44:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P29473 and previous config saved to /var/cache/conftool/dbconfig/20220607-094458-marostegui.json [09:45:01] (03PS5) 10Slyngshede: profile::aptrepo::wikimedia Fix missing vhost variables. [puppet] - 10https://gerrit.wikimedia.org/r/803463 [09:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:55] (03PS1) 10Jbond: sre.dns.netbox: update to use new netbox servers [cookbooks] - 10https://gerrit.wikimedia.org/r/803466 (https://phabricator.wikimedia.org/T296452) [09:48:25] (03PS2) 10Jbond: netbox: update netbox to use internal discovery address [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) [09:48:51] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803463 (owner: 10Slyngshede) [09:49:25] (03CR) 10Slyngshede: [C: 03+2] profile::aptrepo::wikimedia Fix missing vhost variables. [puppet] - 10https://gerrit.wikimedia.org/r/803463 (owner: 10Slyngshede) [09:50:41] (03CR) 10Btullis: [V: 03+1] "Something seems odd about this to me, because you mention that the pidfile is missing on the "Airflow Analytics" instance, which runs on a" [puppet] - 10https://gerrit.wikimedia.org/r/803396 (https://phabricator.wikimedia.org/T310042) (owner: 10Aqu) [09:51:19] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [09:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:21] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: TCP probe for ldap-ro [puppet] - 10https://gerrit.wikimedia.org/r/802071 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [09:55:05] (03PS1) 10Jbond: netbox: move api url to discovery domain name [puppet] - 10https://gerrit.wikimedia.org/r/803468 (https://phabricator.wikimedia.org/T296452) [09:56:14] (03PS2) 10Jbond: netbox: move api url to discovery domain name [puppet] - 10https://gerrit.wikimedia.org/r/803468 (https://phabricator.wikimedia.org/T296452) [09:58:11] (03CR) 10CI reject: [V: 04-1] netbox: update netbox to use internal discovery address [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [09:58:13] (03CR) 10Muehlenhoff: [C: 03+2] arclamp: add rsync config to migrate Xenon data [puppet] - 10https://gerrit.wikimedia.org/r/802752 (https://phabricator.wikimedia.org/T305460) (owner: 10Muehlenhoff) [09:59:17] (03CR) 10Jbond: [C: 03+1] Netbox Ganeti sync: add groups support (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802179 (https://phabricator.wikimedia.org/T262446) (owner: 10Volans) [09:59:30] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on ganeti3003.esams.wmnet with reason: Remove from cluster for firmware update and eventual reimage [09:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on ganeti3003.esams.wmnet with reason: Remove from cluster for firmware update and eventual reimage [09:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P29474 and previous config saved to /var/cache/conftool/dbconfig/20220607-100003-marostegui.json [10:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/802178 (owner: 10Volans) [10:00:27] moritzm: ok probes for ldap are deployed now, ldap-ro is fine but the ldap-ro-ssl is failing due to the certificate not having any SAN -.- (i've preemptively silenced the alert though) [10:07:53] 10SRE, 10Infrastructure-Foundations: Upgrade ganeti/esams to Bullseye - https://phabricator.wikimedia.org/T308238 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff→03RobH ganeti3003 is removed from the cluster, downtimed and needs the same firmware/NIC updates as ganeti4*/ganeti5* to enable the reimage to Bullseye. [10:08:56] (03CR) 10Volans: [C: 04-1] "It needs some tweaking but the approach is ok." [cookbooks] - 10https://gerrit.wikimedia.org/r/803466 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:09:46] 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10Tobi_WMDE_SW) [10:10:21] (03PS4) 10Lucas Werkmeister (WMDE): httpbb: Add basic tests for query_service (WDQS) [puppet] - 10https://gerrit.wikimedia.org/r/802079 [10:10:23] (03PS5) 10Lucas Werkmeister (WMDE): query_service: don’t cache index files [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) [10:10:25] (03CR) 10Lucas Werkmeister (WMDE): httpbb: Add basic tests for query_service (WDQS) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802079 (owner: 10Lucas Werkmeister (WMDE)) [10:10:28] (03CR) 10Jbond: prometheus::blackbox::check: add new blackbox exporter check (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/787067 (owner: 10Jbond) [10:11:10] (03CR) 10Volans: [C: 04-1] "The logic need tweaking" [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:13:53] (03CR) 10Volans: [C: 04-1] "additional note" [cookbooks] - 10https://gerrit.wikimedia.org/r/803466 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:14:59] 10SRE-Access-Requests, 10LDAP-Access-Requests: Check access rights for GoranSMilovanovic - https://phabricator.wikimedia.org/T310055 (10MoritzMuehlenhoff) Adding @KFrancis for comments; does this need an updated NDA, e.g. signing a volunteer NDA which replaces the former NDA when @GoranSMilovanovic was employe... [10:15:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T310011)', diff saved to https://phabricator.wikimedia.org/P29475 and previous config saved to /var/cache/conftool/dbconfig/20220607-101508-marostegui.json [10:15:10] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [10:15:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1113.eqiad.wmnet with reason: Maintenance [10:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:15] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [10:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T310011)', diff saved to https://phabricator.wikimedia.org/P29476 and previous config saved to /var/cache/conftool/dbconfig/20220607-101516-marostegui.json [10:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:55] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:20:36] (03PS1) 10Milimetric: role::common::aqs: Update mediawiki history source of aqs [puppet] - 10https://gerrit.wikimedia.org/r/803472 [10:20:39] (03PS2) 10Muehlenhoff: Failover idp.w.o to idp1002 (new Bullseye node) [dns] - 10https://gerrit.wikimedia.org/r/802541 (https://phabricator.wikimedia.org/T308214) [10:20:54] (03PS2) 10Muehlenhoff: Failover active IDP nodes to idp1002/idp2002 [puppet] - 10https://gerrit.wikimedia.org/r/802542 (https://phabricator.wikimedia.org/T308214) [10:24:23] (03CR) 10Btullis: [C: 03+2] role::common::aqs: Update mediawiki history source of aqs [puppet] - 10https://gerrit.wikimedia.org/r/803472 (owner: 10Milimetric) [10:25:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T310011)', diff saved to https://phabricator.wikimedia.org/P29477 and previous config saved to /var/cache/conftool/dbconfig/20220607-102559-marostegui.json [10:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:04] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [10:28:05] (03PS1) 10Jbond: utils: Add hooks [puppet] - 10https://gerrit.wikimedia.org/r/803474 [10:28:27] (03PS2) 10Jbond: utils: Add hooks [puppet] - 10https://gerrit.wikimedia.org/r/803474 [10:30:09] (03PS2) 10Jbond: utils: Add small script to set up bundler [puppet] - 10https://gerrit.wikimedia.org/r/803341 [10:30:20] RECOVERY - mediawiki originals uploads -hourly- for codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [10:32:27] !log btullis@cumin1001 START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [10:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:38] (03PS2) 10Jbond: sre.dns.netbox: update to use new netbox servers [cookbooks] - 10https://gerrit.wikimedia.org/r/803466 (https://phabricator.wikimedia.org/T296452) [10:36:44] (03CR) 10Jbond: "updated thanks" [cookbooks] - 10https://gerrit.wikimedia.org/r/803466 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [10:37:28] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host ms-be1054.eqiad.wmnet with OS bullseye [10:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:33] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host ms-be1054.eqiad.wmnet with OS bullseye [10:38:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. [10:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:54] (03CR) 10Jbond: [C: 03+1] "LGTM assuming calbon approves" [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044) (owner: 10Elukey) [10:40:19] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [10:40:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P29478 and previous config saved to /var/cache/conftool/dbconfig/20220607-104104-marostegui.json [10:41:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:31] (03CR) 10Dom Walden: mathoid: pipeline bot promote (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/803305 (owner: 10PipelineBot) [10:51:24] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid public cluster: Roll restart of Druid jvm daemons. [10:51:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:59] 10SRE, 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.8.2 - https://phabricator.wikimedia.org/T309116 (10JMeybohm) 05Open→03Resolved 4.8.2 deployed fleet wide [10:54:25] (03PS1) 10Slyngshede: profile::aptrepo::wikimedia Enable private apt repo. [puppet] - 10https://gerrit.wikimedia.org/r/803480 [10:55:54] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) [10:56:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P29479 and previous config saved to /var/cache/conftool/dbconfig/20220607-105609-marostegui.json [10:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:23] jouncebot: next [10:56:23] In 2 hour(s) and 3 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220607T1300) [10:56:35] 10SRE, 10Performance-Team, 10Patch-For-Review: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 (10MoritzMuehlenhoff) [10:59:18] (03PS2) 10JMeybohm: mediawiki: stop revalidating opcache on canaries [puppet] - 10https://gerrit.wikimedia.org/r/802134 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [10:59:24] (03CR) 10Ayounsi: [C: 03+1] netbox: decrease TTL to 5m for fail over [dns] - 10https://gerrit.wikimedia.org/r/803462 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:00:11] (03CR) 10Ayounsi: [C: 03+1] netbox: switch netbox to new infrastructure [dns] - 10https://gerrit.wikimedia.org/r/803459 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:01:20] (03PS2) 10Slyngshede: profile::aptrepo::wikimedia Enable private apt repo. [puppet] - 10https://gerrit.wikimedia.org/r/803480 [11:01:42] (03CR) 10Ayounsi: [C: 03+1] netbox: increase TTL to 1D [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:01:46] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1054.eqiad.wmnet with reason: host reimage [11:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:16] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35766/console" [puppet] - 10https://gerrit.wikimedia.org/r/803480 (owner: 10Slyngshede) [11:02:47] (03CR) 10Ayounsi: [C: 03+1] netbox: Add SANs for addtional vhosts [puppet] - 10https://gerrit.wikimedia.org/r/803456 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:03:34] (03CR) 10JMeybohm: [C: 03+2] mediawiki: stop revalidating opcache on canaries [puppet] - 10https://gerrit.wikimedia.org/r/802134 (https://phabricator.wikimedia.org/T266055) (owner: 10Giuseppe Lavagetto) [11:04:15] (03PS1) 10MVernon: sre.hosts.reimage: grammar tweak [cookbooks] - 10https://gerrit.wikimedia.org/r/803482 [11:04:33] (03PS1) 10Btullis: Bump the eventgate image used by eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/803483 (https://phabricator.wikimedia.org/T306181) [11:04:47] !log restarting php-fpm on api and appserver canaries - T266055 [11:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:50] T266055: Update Scap to perform rolling restart for all MW deploy - https://phabricator.wikimedia.org/T266055 [11:04:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1054.eqiad.wmnet with reason: host reimage [11:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:29] (03CR) 10Volans: [C: 03+2] "Thanks!" [cookbooks] - 10https://gerrit.wikimedia.org/r/803482 (owner: 10MVernon) [11:09:32] (03Merged) 10jenkins-bot: sre.hosts.reimage: grammar tweak [cookbooks] - 10https://gerrit.wikimedia.org/r/803482 (owner: 10MVernon) [11:09:53] (03PS3) 10Jbond: netbox: update netbox to use internal discovery address [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) [11:10:36] (03CR) 10JMeybohm: [C: 03+1] service: image-suggestion state to monitoring_setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799358 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [11:10:42] (03CR) 10Jbond: [C: 03+2] utils: Add hooks [puppet] - 10https://gerrit.wikimedia.org/r/803474 (owner: 10Jbond) [11:11:03] (03CR) 10Ayounsi: "profile::netbox::discovery_name have different values depending on the hiera groups:" [puppet] - 10https://gerrit.wikimedia.org/r/803468 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:11:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T310011)', diff saved to https://phabricator.wikimedia.org/P29480 and previous config saved to /var/cache/conftool/dbconfig/20220607-111114-marostegui.json [11:11:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance [11:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1131.eqiad.wmnet with reason: Maintenance [11:11:18] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T310011)', diff saved to https://phabricator.wikimedia.org/P29481 and previous config saved to /var/cache/conftool/dbconfig/20220607-111122-marostegui.json [11:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:05] (03CR) 10Btullis: [C: 03+2] Bump the eventgate image used by eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/803483 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [11:15:56] (03PS1) 10Ayounsi: Enable Icinga notifications on new netbox frontends [puppet] - 10https://gerrit.wikimedia.org/r/803485 (https://phabricator.wikimedia.org/T296452) [11:16:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T310011)', diff saved to https://phabricator.wikimedia.org/P29482 and previous config saved to /var/cache/conftool/dbconfig/20220607-111657-marostegui.json [11:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:00] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [11:18:41] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:43] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1054.eqiad.wmnet with OS bullseye [11:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:45] (03Merged) 10jenkins-bot: Bump the eventgate image used by eventgate-analytics-external [deployment-charts] - 10https://gerrit.wikimedia.org/r/803483 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [11:18:47] 10SRE-swift-storage: Upgrade Swift ms cluster to Bullseye and revisit mkfs.xfs options - https://phabricator.wikimedia.org/T279637 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host ms-be1054.eqiad.wmnet with OS bullseye completed: - ms-be1054 (**PASS**) - Downtim... [11:20:15] (03CR) 10CI reject: [V: 04-1] netbox: update netbox to use internal discovery address [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:25:57] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [11:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:28] (03CR) 10JMeybohm: [C: 04-1] Run isort/black on the codebase (031 comment) [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801642 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [11:26:36] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [11:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:09] (03CR) 10JMeybohm: [C: 03+1] tox: add formattercheck [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801643 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [11:27:31] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [11:27:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:39] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [11:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:46] !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox [11:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:03] (03CR) 10Jbond: [C: 03+1] Failover idp.w.o to idp1002 (new Bullseye node) [dns] - 10https://gerrit.wikimedia.org/r/802541 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [11:30:16] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [11:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802542 (https://phabricator.wikimedia.org/T308214) (owner: 10Muehlenhoff) [11:31:21] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [11:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:26] (03CR) 10JMeybohm: "As this will change metric names: Are there any dependent changes to alerts/dashboards etc. that need to be taken care of?" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801644 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [11:32:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P29483 and previous config saved to /var/cache/conftool/dbconfig/20220607-113202-marostegui.json [11:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:06] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:52] (03CR) 10Ayounsi: "Thanks, replies inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [11:35:07] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [11:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:31] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid public cluster: Roll restart of Druid jvm daemons. [11:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:45] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:47:01] (03PS1) 10Jbond: netbox: move config for new serveres to role [puppet] - 10https://gerrit.wikimedia.org/r/803488 (https://phabricator.wikimedia.org/T296452) [11:47:03] (03PS7) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [11:47:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P29484 and previous config saved to /var/cache/conftool/dbconfig/20220607-114707-marostegui.json [11:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:18] 10SRE, 10serviceops: Migrate Zookeeper/etcd conf cluster in codfw to Buster - https://phabricator.wikimedia.org/T224560 (10JMeybohm) 05Open→03Resolved a:03JMeybohm conf2* servers are on buster as of the not perfectly named ticket T271573 [11:47:20] 10SRE: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10JMeybohm) [11:48:04] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [11:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:37] (03PS2) 10Jbond: netbox: move config for new serveres to role [puppet] - 10https://gerrit.wikimedia.org/r/803488 (https://phabricator.wikimedia.org/T296452) [11:51:03] (03CR) 10CI reject: [V: 04-1] Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [11:51:54] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to buster - https://phabricator.wikimedia.org/T271573 (10JMeybohm) [11:52:01] 10SRE, 10serviceops: Update conf1* servers - https://phabricator.wikimedia.org/T310062 (10JMeybohm) [11:52:15] (03PS1) 10Jbond: hieradata: decommission netbox servers [puppet] - 10https://gerrit.wikimedia.org/r/803489 (https://phabricator.wikimedia.org/T296452) [11:52:18] (03CR) 10Jbond: netbox: move api url to discovery domain name (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803468 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:52:30] (03CR) 10JMeybohm: [C: 04-2] "As discussed on IRC: There is no python-twisted for 3.5 on stretch, so we would need to update the conf1* cluster first - https://phabrica" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801646 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [11:52:45] (03PS2) 10JMeybohm: Port to Python 3.5 [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801646 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [11:53:14] (03CR) 10Ayounsi: [C: 04-1] netbox: move config for new serveres to role (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/803488 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:53:46] (03CR) 10Muehlenhoff: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/802897 (owner: 10Ladsgroup) [11:54:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-public cluster: Roll restart of jvm daemons. [11:54:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:35] !log hnowlan@cumin1001 START - Cookbook sre.dns.netbox [11:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:45] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [11:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:01] (03PS3) 10Jbond: netbox: move config for new serveres to role [puppet] - 10https://gerrit.wikimedia.org/r/803488 (https://phabricator.wikimedia.org/T296452) [11:57:47] (03PS3) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 [11:57:49] (03PS8) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [11:57:49] PROBLEM - SSH on cp5012.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:58:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35768/console" [puppet] - 10https://gerrit.wikimedia.org/r/803488 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [11:58:49] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:04] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/802564 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [11:59:17] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:02:05] (03CR) 10Klausman: [C: 03+1] admin: add ml-team-admins to ores-admin by default [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044) (owner: 10Elukey) [12:02:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T310011)', diff saved to https://phabricator.wikimedia.org/P29485 and previous config saved to /var/cache/conftool/dbconfig/20220607-120212-marostegui.json [12:02:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:02:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1140.eqiad.wmnet with reason: Maintenance [12:02:16] T310011: Adjust the field type of cu_changes.cuc_timestamp to fixed binary and remove default on wmf wikis - https://phabricator.wikimedia.org/T310011 [12:02:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:46] (03PS2) 10Hnowlan: service: image-suggestion state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/799357 (https://phabricator.wikimedia.org/T304891) [12:02:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-druid-analytics cluster: Roll restart of jvm daemons. [12:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:58] (03CR) 10Muehlenhoff: [C: 03+2] "The missing contributor has now opted in via https://phabricator.wikimedia.org/T308013, so merging." [puppet] - 10https://gerrit.wikimedia.org/r/792608 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [12:03:15] (03CR) 10Jbond: [V: 03+1] "updated thanks" [puppet] - 10https://gerrit.wikimedia.org/r/803488 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:03:30] (03PS4) 10Jbond: netbox: move config for new serveres to role [puppet] - 10https://gerrit.wikimedia.org/r/803488 (https://phabricator.wikimedia.org/T296452) [12:03:48] (03PS2) 10Btullis: Use latest image version in all remaining eventgate services [deployment-charts] - 10https://gerrit.wikimedia.org/r/803242 (https://phabricator.wikimedia.org/T306181) [12:04:52] (03CR) 10Jbond: [C: 03+1] "LGTM, FYI i also do this as part of https://gerrit.wikimedia.org/r/c/operations/puppet/+/803488 but its also fine to do this one first" [puppet] - 10https://gerrit.wikimedia.org/r/803485 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [12:06:09] (03PS2) 10Muehlenhoff: gdnsd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/799307 (https://phabricator.wikimedia.org/T308013) [12:06:44] !log btullis@cumin1001 START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [12:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:27] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [12:11:59] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35770/console" [puppet] - 10https://gerrit.wikimedia.org/r/799357 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [12:12:18] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] service: image-suggestion state to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/799357 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [12:12:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [12:12:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:56] (03PS3) 10Hnowlan: service: image-suggestion state to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/799358 (https://phabricator.wikimedia.org/T304891) [12:19:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/803480 (owner: 10Slyngshede) [12:21:40] (03CR) 10Hnowlan: [C: 03+2] service: image-suggestion state to monitoring_setup [puppet] - 10https://gerrit.wikimedia.org/r/799358 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [12:22:40] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/803457 (https://phabricator.wikimedia.org/T310044) (owner: 10Elukey) [12:26:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:26:41] Hi! we're going to do some maintenance on Netbox in the next few minutes, planning for 30min but it should be less. Please don't use write to Netbox for that time being, either directly or through cookbooks. [12:27:56] (03PS2) 10Hnowlan: service: image-suggestion state to production [puppet] - 10https://gerrit.wikimedia.org/r/799998 (https://phabricator.wikimedia.org/T304891) [12:28:33] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.05553 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [12:29:13] (03PS1) 10Muehlenhoff: Trim comment [puppet] - 10https://gerrit.wikimedia.org/r/803490 [12:30:45] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] profile::aptrepo::wikimedia Enable private apt repo. [puppet] - 10https://gerrit.wikimedia.org/r/803480 (owner: 10Slyngshede) [12:36:26] (03Abandoned) 10Ayounsi: Add netbox geodns entries. [dns] - 10https://gerrit.wikimedia.org/r/541602 (https://phabricator.wikimedia.org/T234997) (owner: 10CRusnov) [12:40:06] (03PS1) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803494 (https://phabricator.wikimedia.org/T304328) [12:40:08] (03PS1) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803495 (https://phabricator.wikimedia.org/T304328) [12:40:10] (03PS1) 10Lucas Werkmeister (WMDE): Rename wmgWikibaseUseSSRTermbox to wmgWikibaseTermboxEnabled (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803496 (https://phabricator.wikimedia.org/T304328) [12:40:12] (03PS1) 10Lucas Werkmeister (WMDE): Separate wmgWikibaseTermboxEnabled and wmgWikibaseSSRTermboxServerUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803497 (https://phabricator.wikimedia.org/T304328) [12:40:14] (03PS1) 10Lucas Werkmeister (WMDE): Unconfigure wmgWikibaseSSRTermboxServerUrl on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803498 (https://phabricator.wikimedia.org/T304328) [12:41:02] (03CR) 10Ayounsi: [C: 03+2] netbox: switch netbox to new infrastructure [dns] - 10https://gerrit.wikimedia.org/r/803459 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:41:44] (03CR) 10Lucas Werkmeister (WMDE): Turn Wikbase termbox SSR off for beta wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802770 (https://phabricator.wikimedia.org/T304328) (owner: 10Itamar Givon) [12:43:05] (03PS1) 10Jbond: Revert "service: image-suggestion state to monitoring_setup" [puppet] - 10https://gerrit.wikimedia.org/r/802963 [12:43:30] hnowlan: im reverting you last change monitoring_setup is no longer a valid state, i think godog bushed the CR to chan ge this this morning, will find in a sec [12:43:37] but its currently causing mass puppet failures [12:43:50] (as a follow up something in CI should load an validate the service catalog) [12:44:04] (03CR) 10Jbond: [C: 03+2] Revert "service: image-suggestion state to monitoring_setup" [puppet] - 10https://gerrit.wikimedia.org/r/802963 (owner: 10Jbond) [12:44:49] (03PS1) 10Kevin Bazira: ml-services: add ukwiki & wikidatawiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/803499 (https://phabricator.wikimedia.org/T307418) [12:45:03] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [12:45:43] (03PS1) 10Jbond: service: image-suggestion state to production with paging disabled [puppet] - 10https://gerrit.wikimedia.org/r/802964 (https://phabricator.wikimedia.org/T304891) [12:46:05] (03PS2) 10Jbond: service: image-suggestion state to production with paging disabled [puppet] - 10https://gerrit.wikimedia.org/r/802964 (https://phabricator.wikimedia.org/T304891) [12:46:34] thanks jbond [12:46:35] (03CR) 10Jbond: "had to revert this as moinitoring_setup was recently dropped see : https://gerrit.wikimedia.org/r/c/operations/puppet/+/803231" [puppet] - 10https://gerrit.wikimedia.org/r/799358 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [12:46:49] I'll send a followup email to ops@ [12:47:01] (03CR) 10Jbond: "See the following for recent changes" [puppet] - 10https://gerrit.wikimedia.org/r/802964 (https://phabricator.wikimedia.org/T304891) (owner: 10Jbond) [12:47:20] godog: ack, hnowlan in the mean time i think https://gerrit.wikimedia.org/r/c/operations/puppet/+/802964 is what yuo need [12:47:32] jouncebot: next [12:47:32] In 0 hour(s) and 12 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220607T1300) [12:47:42] (03CR) 10Filippo Giunchedi: [C: 03+1] service: image-suggestion state to production with paging disabled [puppet] - 10https://gerrit.wikimedia.org/r/802964 (https://phabricator.wikimedia.org/T304891) (owner: 10Jbond) [12:47:50] * jbond running puppet on failed hosts [12:47:51] would it be okay if i wanted to backport a patch that adds localisation messages? (any deployers around?) [12:48:13] (03PS3) 10Bartosz Dziewoński: Make new topic tool available as opt-out almost everywhere (phase 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801820 (https://phabricator.wikimedia.org/T309368) [12:49:03] !log installing python-virtualenv updates from Buster point release [12:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:26] (03CR) 10Ayounsi: [C: 03+2] netbox: move config for new serveres to role [puppet] - 10https://gerrit.wikimedia.org/r/803488 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:51:19] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1009 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [12:51:19] jbond: yeah that's it, thank you [12:51:49] (i think i'll try for the next window) [12:51:54] (03PS5) 10Jbond: netbox: move config for new serveres to role [puppet] - 10https://gerrit.wikimedia.org/r/803488 (https://phabricator.wikimedia.org/T296452) [12:51:56] (03PS6) 10DannyS712: phpcs: move AssignmentInControlStructures exclusion inline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/796360 (https://phabricator.wikimedia.org/T171115) [12:52:02] (03PS3) 10DannyS712: phpcs: move Misleading$wgDebugLogFile exclusion inline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802840 (https://phabricator.wikimedia.org/T171115) [12:52:07] (03PS4) 10DannyS712: phpcs: enable and fix FunctionComment.WrongStyle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802841 (https://phabricator.wikimedia.org/T171115) [12:52:13] (03PS7) 10DannyS712: phpcs: enable and configure ValidGlobalName.allowedPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802842 (https://phabricator.wikimedia.org/T171115) [12:52:22] (03PS4) 10DannyS712: phpcs: enable and configure PrefixedGlobalFunctions.allowedPrefix [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802946 (https://phabricator.wikimedia.org/T171115) [12:52:32] (03PS4) 10DannyS712: phpcs: enable and fix MisleadingGlobalNames.Misleading$wgConf [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802947 (https://phabricator.wikimedia.org/T171115) [12:54:13] (03PS1) 10Muehlenhoff: Trim comments [puppet] - 10https://gerrit.wikimedia.org/r/803501 [12:56:03] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10puppet-compiler, 10User-jbond: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) [12:57:09] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [12:57:19] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 (10Papaul) [12:58:49] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:58:59] RECOVERY - SSH on cp5012.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:59:27] (03PS1) 10Ayounsi: Netbox: set correct active server [puppet] - 10https://gerrit.wikimedia.org/r/803505 (https://phabricator.wikimedia.org/T296452) [12:59:56] (03PS1) 10Bluehill395: Show createpage preference only when feature is available [extensions/DiscussionTools] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/802965 (https://phabricator.wikimedia.org/T310053) [12:59:58] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803505 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, and awight: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220607T1300). [13:00:04] DannyS712: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] I'm here : [13:00:12] ) [13:01:00] (03CR) 10Ayounsi: [C: 03+2] Netbox: set correct active server [puppet] - 10https://gerrit.wikimedia.org/r/803505 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [13:01:04] anyone around to deploy? [13:01:40] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:03:42] (03PS3) 10Volans: sre.dns.netbox: update to use new netbox servers [cookbooks] - 10https://gerrit.wikimedia.org/r/803466 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:03:46] (03PS4) 10Volans: sre.dns.netbox: update to use new netbox servers [cookbooks] - 10https://gerrit.wikimedia.org/r/803466 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:03:52] (03PS1) 10Slyngshede: WIP: profile::aptrepo::wikimedia move public apt repo to Apache [puppet] - 10https://gerrit.wikimedia.org/r/803506 [13:04:07] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.003057 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:05:28] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [13:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:43] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [13:06:05] PROBLEM - SSH on wtp1039.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:06:17] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [13:10:26] (03PS3) 10Hnowlan: service: image-suggestion state to production [puppet] - 10https://gerrit.wikimedia.org/r/799998 (https://phabricator.wikimedia.org/T304891) [13:10:37] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:11:25] urbanecm RoanKattouw you're signed up as deployers for the current window - are you available? [13:12:31] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:13:07] (03PS1) 10Jbond: P:netbox: Add hosts entry for service address [puppet] - 10https://gerrit.wikimedia.org/r/803508 (https://phabricator.wikimedia.org/T296452) [13:16:23] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:16:58] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [13:17:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:03] (03CR) 10Volans: "See inline for small improvements." [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:17:11] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Radar): deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10hashar) [13:17:33] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:19:20] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Radar): deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10hashar) deployment-parsoid12 went out of memory this morning which I have filed as T310069. It... [13:20:02] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [13:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:30] (03PS1) 10Jbond: netbox-extra: update domain to point to discovery address [dns] - 10https://gerrit.wikimedia.org/r/803511 (https://phabricator.wikimedia.org/T296452) [13:20:38] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [13:20:38] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:51] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [13:20:51] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:20:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:56] (03CR) 10Hnowlan: [C: 03+2] service: image-suggestion state to production [puppet] - 10https://gerrit.wikimedia.org/r/799998 (https://phabricator.wikimedia.org/T304891) (owner: 10Hnowlan) [13:21:02] (03CR) 10CI reject: [V: 04-1] netbox-extra: update domain to point to discovery address [dns] - 10https://gerrit.wikimedia.org/r/803511 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:21:12] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [13:21:12] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:21:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:02] (03CR) 10Jbond: netbox: update netbox to use internal discovery address (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:27:20] (03PS4) 10Jbond: netbox: update netbox to use internal discovery address [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) [13:28:09] (03PS1) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 [13:28:36] (03CR) 10Ayounsi: [C: 03+1] netbox-extra: update domain to point to discovery address [dns] - 10https://gerrit.wikimedia.org/r/803511 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:29:04] (03CR) 10CI reject: [V: 04-1] class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 (owner: 10Slyngshede) [13:30:08] (03PS2) 10Slyngshede: class:apt Add new private repo. [puppet] - 10https://gerrit.wikimedia.org/r/803512 [13:32:33] (03CR) 10Btullis: [C: 03+2] Use latest image version in all remaining eventgate services [deployment-charts] - 10https://gerrit.wikimedia.org/r/803242 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [13:33:16] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster [13:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:26] (03CR) 10CI reject: [V: 04-1] netbox: update netbox to use internal discovery address [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:36:50] (03Merged) 10jenkins-bot: Use latest image version in all remaining eventgate services [deployment-charts] - 10https://gerrit.wikimedia.org/r/803242 (https://phabricator.wikimedia.org/T306181) (owner: 10Btullis) [13:37:30] (03PS1) 10Jbond: netbox-extra: add discover address as a cname top netbox [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) [13:38:02] (03CR) 10CI reject: [V: 04-1] netbox-extra: add discover address as a cname top netbox [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:38:28] (03PS1) 10Jbond: cache: add routing for netbox-extra.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/803515 (https://phabricator.wikimedia.org/T296452) [13:39:24] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [13:40:17] (03Abandoned) 10Jbond: netbox-extra: update domain to point to discovery address [dns] - 10https://gerrit.wikimedia.org/r/803511 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:40:19] (03CR) 10Volans: [C: 03+1] "LGTM, you will need to bypass CI for this. Until the repo is not back public CI would fail." [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:40:49] (03CR) 10Ayounsi: [C: 03+1] "I'm no expert but it looks coherent to me" [puppet] - 10https://gerrit.wikimedia.org/r/803515 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:40:55] (03CR) 10Herron: "Hi, I noticed that logster-badpass_priv.service and logster-csp.service are currently in failed state on mwlog1002, is that known/expected" [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:41:23] (03CR) 10Volans: [C: 04-1] "wrong name" [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:41:35] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [13:41:46] 10SRE, 10ops-codfw: (Need By:TBD) rack/setup/install row A new PDUs - https://phabricator.wikimedia.org/T309957 (10Papaul) [13:42:13] (03CR) 10Slyngshede: [C: 03+2] Rewrite logster::job to use systemd timers. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:42:20] (03CR) 10Volans: [C: 04-1] "wrong name: netbox-exports" [puppet] - 10https://gerrit.wikimedia.org/r/803515 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:42:47] (03PS2) 10Jbond: netbox-extra: add discover address as a cname top netbox [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) [13:43:02] (03CR) 10Volans: [C: 03+2] pylint: remove unnecessary comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/803316 (owner: 10Volans) [13:43:13] (03CR) 10Jbond: netbox-extra: add discover address as a cname top netbox (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:43:15] (03CR) 10CI reject: [V: 04-1] netbox-extra: add discover address as a cname top netbox [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:43:58] (03CR) 10Ayounsi: [C: 03+1] pylint: remove unnecessary comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/803316 (owner: 10Volans) [13:44:13] (03PS2) 10Jbond: cache: add routing for netbox-extra.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/803515 (https://phabricator.wikimedia.org/T296452) [13:45:20] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [13:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:45] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [13:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:55] (03CR) 10Ayounsi: cache: add routing for netbox-extra.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803515 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:46:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:46:09] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [13:46:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:02] (03PS3) 10Volans: netbox-exports: add exports name as CNAME to netbox [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:47:04] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [13:47:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:23] (03CR) 10Volans: [C: 03+1] "LGTM, will need to bypass CI" [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:47:31] (03CR) 10CI reject: [V: 04-1] netbox-exports: add exports name as CNAME to netbox [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:47:43] (03CR) 10Hnowlan: "This service is now `state: production` and this change can probably be abandoned." [puppet] - 10https://gerrit.wikimedia.org/r/802964 (https://phabricator.wikimedia.org/T304891) (owner: 10Jbond) [13:47:57] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [13:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:58] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [13:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:37] (03PS3) 10Jbond: cache: add routing for netbox-extra.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/803515 (https://phabricator.wikimedia.org/T296452) [13:49:50] (03PS4) 10Jbond: netbox-exports: add exports name as CNAME to netbox.discover.wmnet [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) [13:50:19] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [13:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:24] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.12 point update - https://phabricator.wikimedia.org/T304546 (10MoritzMuehlenhoff) [13:50:40] (03CR) 10CI reject: [V: 04-1] netbox-exports: add exports name as CNAME to netbox.discover.wmnet [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:50:49] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [13:50:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:54] (03PS4) 10Jbond: cache: add routing for netbox-exports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/803515 (https://phabricator.wikimedia.org/T296452) [13:51:00] (03Merged) 10jenkins-bot: pylint: remove unnecessary comments [software/spicerack] - 10https://gerrit.wikimedia.org/r/803316 (owner: 10Volans) [13:51:00] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [13:51:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:13] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [13:51:21] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803515 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:51:25] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1009 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [13:51:41] (03CR) 10Jbond: [V: 03+2 C: 03+2] netbox-exports: add exports name as CNAME to netbox.discover.wmnet [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:51:48] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [13:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:15] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [13:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:29] (03CR) 10Ayounsi: netbox-exports: add exports name as CNAME to netbox.discover.wmnet (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/803514 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:53:12] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [13:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:25] (03PS5) 10Volans: netbox: update netbox to use internal discovery address [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:53:25] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox-exports.discovery.wmnet on all recursors [13:53:29] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox-exports.discovery.wmnet on all recursors [13:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:33] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [13:53:33] (03CR) 10Volans: "reply inline" [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:38] (03PS6) 10Volans: netbox: update netbox to use internal discovery address [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:53:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:07] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [13:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:17] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [13:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:11] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [13:55:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:27] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [13:56:51] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [13:57:11] (03CR) 10Jbond: [C: 03+2] cache: add routing for netbox-exports.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/803515 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [13:57:54] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [13:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:38] (03CR) 10Slyngshede: [C: 03+2] Rewrite logster::job to use systemd timers. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/790325 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [13:58:45] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [13:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:36] (03Abandoned) 10Jbond: service: image-suggestion state to production with paging disabled [puppet] - 10https://gerrit.wikimedia.org/r/802964 (https://phabricator.wikimedia.org/T304891) (owner: 10Jbond) [14:00:06] (03PS1) 10Slyngshede: logster::job revert systemd timer migration. [puppet] - 10https://gerrit.wikimedia.org/r/803518 [14:00:42] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox-exports.wikimedia.org on all recursors [14:00:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:45] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox-exports.wikimedia.org on all recursors [14:00:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:59] (03PS1) 10Bartosz Dziewoński: Add preference for offering new topic tool when creating new talk pages [extensions/DiscussionTools] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803527 (https://phabricator.wikimedia.org/T297990) [14:01:47] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Radar): deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10Zabe) >>! In T309413#7985917, @hashar wrote: > deployment-parsoid12 went out of memory this mo... [14:01:53] (03PS1) 10Bartosz Dziewoński: Show createpage preference only when feature is available [extensions/DiscussionTools] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803528 (https://phabricator.wikimedia.org/T310053) [14:02:08] !log jbond@cumin1001 START - Cookbook sre.dns.wipe-cache netbox-exports.discovery.wmnet on all recursors [14:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:11] !log jbond@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netbox-exports.discovery.wmnet on all recursors [14:02:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:40] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [14:02:41] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [14:02:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:05] (03CR) 10CI reject: [V: 04-1] logster::job revert systemd timer migration. [puppet] - 10https://gerrit.wikimedia.org/r/803518 (owner: 10Slyngshede) [14:03:18] XioNoX: that cookbook needs a spicerck merge and release + merging the cookbook patch [14:04:00] (03CR) 10Volans: [C: 03+2] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:04:37] (03PS2) 10Slyngshede: logster::job revert systemd timer migration. [puppet] - 10https://gerrit.wikimedia.org/r/803518 [14:06:31] RECOVERY - SSH on wtp1039.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:08:33] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [14:09:10] (03CR) 10Slyngshede: "Revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/790325 until I can fix systemd timers correctly." [puppet] - 10https://gerrit.wikimedia.org/r/803518 (owner: 10Slyngshede) [14:11:29] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35773/console" [puppet] - 10https://gerrit.wikimedia.org/r/803508 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:11:51] (03Merged) 10jenkins-bot: netbox: update netbox to use internal discovery address [software/spicerack] - 10https://gerrit.wikimedia.org/r/803465 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:12:34] (03PS4) 10Volans: netbox: increase TTL to 1D [dns] - 10https://gerrit.wikimedia.org/r/803460 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:17:22] 10SRE, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team (Radar): deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) >>! In T309413#7986067, @Zabe wrote: > Could you (or someone else) add me to tha... [14:23:49] (03PS1) 10Hnowlan: restbase: add restbase103[123] [puppet] - 10https://gerrit.wikimedia.org/r/803520 [14:23:57] (03PS1) 10Volans: CHANGELOG: add changelogs for release v2.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/803521 [14:24:12] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v2.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/803521 (owner: 10Volans) [14:30:43] !log jbond@cumin1001 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS buster [14:30:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:24] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v2.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/803521 (owner: 10Volans) [14:37:45] (03PS1) 10Volans: Upstream release v2.6.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/803523 [14:38:02] (03CR) 10Volans: [C: 03+2] Upstream release v2.6.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/803523 (owner: 10Volans) [14:38:04] (03CR) 10Herron: [C: 03+1] logster::job revert systemd timer migration. [puppet] - 10https://gerrit.wikimedia.org/r/803518 (owner: 10Slyngshede) [14:40:17] (03CR) 10Slyngshede: [C: 03+2] logster::job revert systemd timer migration. [puppet] - 10https://gerrit.wikimedia.org/r/803518 (owner: 10Slyngshede) [14:41:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:41:19] !log jbond@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [14:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:49] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [14:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:42] !log adding additional disk for /srv to webperf2004 T305460 [14:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:45] T305460: Upgrade webperf hosts to Bullseye - https://phabricator.wikimedia.org/T305460 [14:46:06] (03Merged) 10jenkins-bot: Upstream release v2.6.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/803523 (owner: 10Volans) [14:48:07] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [14:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:01] (03PS2) 10Muehlenhoff: Trim comment [puppet] - 10https://gerrit.wikimedia.org/r/802175 [14:50:18] !log uploaded spicerack_2.6.0 to apt.wikimedia.org bullseye-wikimedia [14:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:08] (03CR) 10Jbond: [C: 03+1] Trim comment [puppet] - 10https://gerrit.wikimedia.org/r/803490 (owner: 10Muehlenhoff) [14:51:13] (03CR) 10Jbond: [C: 03+1] Trim comments [puppet] - 10https://gerrit.wikimedia.org/r/803501 (owner: 10Muehlenhoff) [14:52:08] !log upgrading spicerack to v2.6.0 on cumin2002 [14:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:42] (03CR) 10Muehlenhoff: [C: 03+2] Trim comment [puppet] - 10https://gerrit.wikimedia.org/r/802175 (owner: 10Muehlenhoff) [14:55:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] "Looks like noone is using it after all now, so +1!" [puppet] - 10https://gerrit.wikimedia.org/r/802876 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [14:55:26] (03PS5) 10Volans: sre.dns.netbox: update to use new netbox servers [cookbooks] - 10https://gerrit.wikimedia.org/r/803466 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:55:35] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/803466 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:55:55] (03CR) 10Klausman: [C: 03+2] ml-services: add ukwiki & wikidatawiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/803499 (https://phabricator.wikimedia.org/T307418) (owner: 10Kevin Bazira) [14:56:43] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update cxserver to 2022-05-31-123738-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/801663 (https://phabricator.wikimedia.org/T306963) (owner: 10KartikMistry) [14:56:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [14:57:27] !log jbond@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS buster [14:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:30] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:00:03] (03PS1) 10Jbond: utils/hooks/pre-commit: only run rake job if new files exist [puppet] - 10https://gerrit.wikimedia.org/r/803524 [15:00:07] (03Merged) 10jenkins-bot: ml-services: add ukwiki & wikidatawiki articlequality isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/803499 (https://phabricator.wikimedia.org/T307418) (owner: 10Kevin Bazira) [15:00:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] scap.cfg.erb: Set release_repo_update_mediawiki_releases_values_cmd [puppet] - 10https://gerrit.wikimedia.org/r/802795 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [15:00:28] (03PS2) 10Alexandros Kosiaris: scap.cfg.erb: Set release_repo_update_mediawiki_releases_values_cmd [puppet] - 10https://gerrit.wikimedia.org/r/802795 (https://phabricator.wikimedia.org/T299648) (owner: 10Ahmon Dancy) [15:00:34] (03CR) 10Volans: [V: 03+2 C: 03+2] sre.dns.netbox: update to use new netbox servers [cookbooks] - 10https://gerrit.wikimedia.org/r/803466 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [15:00:42] (03CR) 10Jbond: [C: 03+2] utils/hooks/pre-commit: only run rake job if new files exist [puppet] - 10https://gerrit.wikimedia.org/r/803524 (owner: 10Jbond) [15:00:47] (03CR) 10Jbond: [V: 03+2 C: 03+2] utils/hooks/pre-commit: only run rake job if new files exist [puppet] - 10https://gerrit.wikimedia.org/r/803524 (owner: 10Jbond) [15:01:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:02:53] akosiaris: Thanks! [15:03:18] (03PS1) 10Jbond: pre-commit: add SPDX header back [puppet] - 10https://gerrit.wikimedia.org/r/803546 [15:03:31] (03CR) 10Jbond: [V: 03+2 C: 03+2] pre-commit: add SPDX header back [puppet] - 10https://gerrit.wikimedia.org/r/803546 (owner: 10Jbond) [15:03:39] (03CR) 10CI reject: [V: 04-1] pre-commit: add SPDX header back [puppet] - 10https://gerrit.wikimedia.org/r/803546 (owner: 10Jbond) [15:03:44] (03PS2) 10Jbond: pre-commit: add SPDX header back [puppet] - 10https://gerrit.wikimedia.org/r/803546 [15:03:47] (03CR) 10Jbond: [V: 03+2] pre-commit: add SPDX header back [puppet] - 10https://gerrit.wikimedia.org/r/803546 (owner: 10Jbond) [15:05:05] 10SRE, 10Release-Engineering-Team, 10Scap, 10serviceops: Deploy Scap version 4.8.2 - https://phabricator.wikimedia.org/T309116 (10dancy) Thanks @JMeybohm ! [15:07:40] (03PS1) 10Muehlenhoff: exim4: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803548 (https://phabricator.wikimedia.org/T308013) [15:07:42] (03PS1) 10Cathal Mooney: Adjust EVPN switch BGP template [homer/public] - 10https://gerrit.wikimedia.org/r/803549 (https://phabricator.wikimedia.org/T302198) [15:08:22] !log volans@cumin1001 START - Cookbook sre.dns.netbox [15:08:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:18] (03CR) 10Muehlenhoff: [C: 03+2] Trim comments [puppet] - 10https://gerrit.wikimedia.org/r/803501 (owner: 10Muehlenhoff) [15:09:25] (03PS2) 10Muehlenhoff: Trim comments [puppet] - 10https://gerrit.wikimedia.org/r/803501 [15:15:16] !log volans@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:15:27] !log volans@cumin1001 START - Cookbook sre.dns.netbox [15:17:17] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803552 (https://phabricator.wikimedia.org/T128546) [15:17:22] !log volans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:20:32] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:41] (03PS2) 10Milimetric: Split up the tables we sqoop [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) [15:20:59] (03CR) 10Milimetric: Split up the tables we sqoop (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) (owner: 10Milimetric) [15:21:35] (03PS1) 10Filippo Giunchedi: prometheus: generate per-service TCP blackbox module [puppet] - 10https://gerrit.wikimedia.org/r/803553 (https://phabricator.wikimedia.org/T305847) [15:21:37] (03PS1) 10Filippo Giunchedi: hieradata: set SNI for ldap-ro [puppet] - 10https://gerrit.wikimedia.org/r/803554 (https://phabricator.wikimedia.org/T305847) [15:23:09] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: sync [15:23:32] (03CR) 10CI reject: [V: 04-1] Split up the tables we sqoop [puppet] - 10https://gerrit.wikimedia.org/r/802598 (https://phabricator.wikimedia.org/T309806) (owner: 10Milimetric) [15:23:34] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: sync [15:25:35] (03CR) 10CI reject: [V: 04-1] prometheus: generate per-service TCP blackbox module [puppet] - 10https://gerrit.wikimedia.org/r/803553 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [15:27:09] (03PS3) 10Cathal Mooney: Add cloudsw1-e4 and cloudsw1-f4 to mgmt and adjust existing cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/802499 (https://phabricator.wikimedia.org/T304989) [15:27:17] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35776/console" [puppet] - 10https://gerrit.wikimedia.org/r/803468 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [15:27:53] (03PS9) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [15:29:11] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803468 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [15:30:06] (03PS2) 10Filippo Giunchedi: prometheus: generate per-service TCP blackbox module [puppet] - 10https://gerrit.wikimedia.org/r/803553 (https://phabricator.wikimedia.org/T305847) [15:30:08] (03PS2) 10Filippo Giunchedi: hieradata: set SNI for ldap-ro [puppet] - 10https://gerrit.wikimedia.org/r/803554 (https://phabricator.wikimedia.org/T305847) [15:35:33] (03CR) 10Ayounsi: [C: 03+1] "Got caught then missed, see:" [homer/public] - 10https://gerrit.wikimedia.org/r/803549 (https://phabricator.wikimedia.org/T302198) (owner: 10Cathal Mooney) [15:37:41] (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox: move api url to discovery domain name [puppet] - 10https://gerrit.wikimedia.org/r/803468 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [15:38:06] (03CR) 10Cathal Mooney: [C: 03+2] Adjust EVPN switch BGP template (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/803549 (https://phabricator.wikimedia.org/T302198) (owner: 10Cathal Mooney) [15:38:32] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:38:46] (03Merged) 10jenkins-bot: Adjust EVPN switch BGP template [homer/public] - 10https://gerrit.wikimedia.org/r/803549 (https://phabricator.wikimedia.org/T302198) (owner: 10Cathal Mooney) [15:40:32] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [15:41:24] PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:41:46] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35780/console" [puppet] - 10https://gerrit.wikimedia.org/r/803554 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [15:42:32] PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:45:15] (03Abandoned) 10Ayounsi: Enable Icinga notifications on new netbox frontends [puppet] - 10https://gerrit.wikimedia.org/r/803485 (https://phabricator.wikimedia.org/T296452) (owner: 10Ayounsi) [15:45:20] (03CR) 10Jbond: [C: 03+1] exim4: Assign SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803548 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [15:50:05] jouncebot: nowandnext [15:50:05] No deployments scheduled for the next 0 hour(s) and 9 minute(s) [15:50:05] In 0 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220607T1600) [15:54:14] PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:54:14] PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:54:14] PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:57:30] PROBLEM - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:57:52] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 192 probes of 679 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:58:24] PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:59:40] PROBLEM - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:00:05] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220607T1600). [16:00:06] Mitar: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:42] in a meeting for 30m, sorry -- I can deploy then, unless jbond can get to it first [16:00:44] (03PS2) 10Filippo Giunchedi: Run isort/black on the codebase [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801642 (https://phabricator.wikimedia.org/T309546) [16:00:46] (03PS2) 10Filippo Giunchedi: tox: add formattercheck [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801643 (https://phabricator.wikimedia.org/T309546) [16:00:48] (03PS2) 10Filippo Giunchedi: Use etcdmirror namespace for metrics [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801644 (https://phabricator.wikimedia.org/T309546) [16:00:50] (03PS2) 10Filippo Giunchedi: Export lag as a Gauge metric [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801645 (https://phabricator.wikimedia.org/T309546) [16:00:55] (03CR) 10Filippo Giunchedi: Run isort/black on the codebase (031 comment) [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801642 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [16:03:13] (03CR) 10Jbond: [V: 03+1 C: 03+2] C:puppetmaster: Add requestctl validate to the private repo pre-commit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803324 (owner: 10Jbond) [16:03:15] (03PS1) 10Jbond: puppetmaster: update private repo pre-commit to error un-staged [puppet] - 10https://gerrit.wikimedia.org/r/803560 [16:04:10] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 60 probes of 679 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:04:38] (03PS2) 10Jbond: puppetmaster: update private repo pre-commit to error un-staged [puppet] - 10https://gerrit.wikimedia.org/r/803560 [16:04:40] (03PS1) 10Dduvall: testwikis wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803561 (https://phabricator.wikimedia.org/T308068) [16:04:42] (03CR) 10Dduvall: [C: 03+2] testwikis wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803561 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [16:05:12] (03CR) 10CI reject: [V: 04-1] puppetmaster: update private repo pre-commit to error un-staged [puppet] - 10https://gerrit.wikimedia.org/r/803560 (owner: 10Jbond) [16:05:56] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803561 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [16:06:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:55] !log dduvall@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.15 refs T308068 [16:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:59] T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068 [16:08:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:08:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:06] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [16:11:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:23] (03PS1) 10Jbond: utils: hooks pre-push [puppet] - 10https://gerrit.wikimedia.org/r/803562 [16:13:29] rzl: yes ill gt it [16:13:31] (03CR) 10Filippo Giunchedi: Use etcdmirror namespace for metrics (031 comment) [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801644 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [16:14:04] dancy: yw :-) [16:14:31] thanks akosiaris [16:15:42] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:16:13] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:17:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:29] (03PS10) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [16:18:42] (03PS4) 10Cathal Mooney: Add cloudsw1-e4 and cloudsw1-f4 to mgmt and adjust existing cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/802499 (https://phabricator.wikimedia.org/T304989) [16:19:01] (03CR) 10Ayounsi: Initial support for servers switch interfaces (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 (owner: 10Ayounsi) [16:20:59] (03CR) 10Cathal Mooney: [C: 03+2] Add cloudsw1-e4 and cloudsw1-f4 to mgmt and adjust existing cloudsw [puppet] - 10https://gerrit.wikimedia.org/r/802499 (https://phabricator.wikimedia.org/T304989) (owner: 10Cathal Mooney) [16:21:22] !log dduvall@deploy1002 scap failed: average error rate on 8/8 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details) [16:21:23] !log dduvall@deploy1002 scap failed: RuntimeError scap failed: average error rate on 8/8 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details) (duration: 14m 27s) [16:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:15] (03CR) 10Jbond: "can we please get a +1 from Ariel before progressing, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/802921 (https://phabricator.wikimedia.org/T301104) (owner: 10Mitar) [16:28:31] (03CR) 10Jbond: [C: 03+2] utils: hooks pre-push [puppet] - 10https://gerrit.wikimedia.org/r/803562 (owner: 10Jbond) [16:31:53] (03PS1) 10Andrew Bogott: Openstack Keystone: support creation of additional domains [puppet] - 10https://gerrit.wikimedia.org/r/803567 (https://phabricator.wikimedia.org/T280792) [16:31:55] (03PS1) 10Andrew Bogott: Keystone: support config for arbitrary sql-based service domains [puppet] - 10https://gerrit.wikimedia.org/r/803568 (https://phabricator.wikimedia.org/T280792) [16:32:07] !log scap deploy-promote testwikis failed at invocation of logstash_checker.py ("logstash_checker.py: error: argument --delay: invalid int value: '40.406498670578'") T308068 [16:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:12] (03PS1) 10Zabe: kubeadm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803569 (https://phabricator.wikimedia.org/T308013) [16:32:12] T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068 [16:32:14] (03PS1) 10Zabe: kibana: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803570 (https://phabricator.wikimedia.org/T308013) [16:32:16] (03PS1) 10Zabe: kerberos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803571 (https://phabricator.wikimedia.org/T308013) [16:32:18] (03PS1) 10Zabe: keepalived: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803572 (https://phabricator.wikimedia.org/T308013) [16:32:20] (03PS1) 10Zabe: karapace: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803573 (https://phabricator.wikimedia.org/T308013) [16:32:22] (03PS1) 10Zabe: kafkatee: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803574 (https://phabricator.wikimedia.org/T308013) [16:32:24] (03PS1) 10Zabe: initramfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803575 (https://phabricator.wikimedia.org/T308013) [16:32:26] (03PS1) 10Zabe: imagecatalog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803576 (https://phabricator.wikimedia.org/T308013) [16:32:30] (03PS1) 10Zabe: httpd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803577 (https://phabricator.wikimedia.org/T308013) [16:32:32] (03PS2) 10Muehlenhoff: Enable ganeti4004 as Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/792670 [16:33:00] (03CR) 10CI reject: [V: 04-1] Openstack Keystone: support creation of additional domains [puppet] - 10https://gerrit.wikimedia.org/r/803567 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [16:33:06] (03CR) 10Muehlenhoff: [C: 03+1] sre: update renamed otrs role to vrts [puppet] - 10https://gerrit.wikimedia.org/r/802579 (https://phabricator.wikimedia.org/T293942) (owner: 10Dzahn) [16:34:12] thanks jbond <3 [16:37:01] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/803573 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [16:37:07] (03PS2) 10Muehlenhoff: karapace: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803573 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [16:37:23] (03PS2) 10Andrew Bogott: Openstack Keystone: support creation of additional domains [puppet] - 10https://gerrit.wikimedia.org/r/803567 (https://phabricator.wikimedia.org/T280792) [16:37:25] (03PS2) 10Andrew Bogott: Keystone: support config for arbitrary sql-based service domains [puppet] - 10https://gerrit.wikimedia.org/r/803568 (https://phabricator.wikimedia.org/T280792) [16:38:14] (03PS1) 10Jbond: P:netbox: add REQUESTS_CA_BUNDLE variable to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/803579 (https://phabricator.wikimedia.org/T296452) [16:38:30] PROBLEM - Check systemd state on netbox2002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_dump_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:45] (03PS3) 10Andrew Bogott: Keystone: Include config for 'magnum' service domain in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/803568 (https://phabricator.wikimedia.org/T280792) [16:40:01] (03PS2) 10Volans: P:netbox: add REQUESTS_CA_BUNDLE variable to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/803579 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [16:41:17] (03CR) 10RLazarus: [C: 03+2] httpbb: Add basic tests for query_service (WDQS) [puppet] - 10https://gerrit.wikimedia.org/r/802079 (owner: 10Lucas Werkmeister (WMDE)) [16:41:24] (03PS3) 10Volans: P:netbox: add REQUESTS_CA_BUNDLE variable to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/803579 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [16:41:40] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803579 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [16:41:42] (03CR) 10RLazarus: [C: 03+2] httpbb: Add basic tests for query_service (WDQS) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/802079 (owner: 10Lucas Werkmeister (WMDE)) [16:43:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:53] (03CR) 10Muehlenhoff: [C: 03+2] "Thanks! Merging" [puppet] - 10https://gerrit.wikimedia.org/r/803575 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [16:43:59] (03PS2) 10Muehlenhoff: initramfs: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803575 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [16:45:30] ^ FYI I missed something in 802079 and broke puppet on cumin*,deploy*, fixing now [16:45:52] (03PS4) 10Jbond: P:netbox: add REQUESTS_CA_BUNDLE variable to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/803579 (https://phabricator.wikimedia.org/T296452) [16:45:54] (03PS4) 10Andrew Bogott: Keystone: Include config for 'magnum' service domain in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/803568 (https://phabricator.wikimedia.org/T280792) [16:46:32] rzl: ack let me know if you need a hand [16:46:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35785/console" [puppet] - 10https://gerrit.wikimedia.org/r/803579 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [16:46:48] nah, easy one, but thanks [16:46:57] cool there the best :) [16:47:35] (03PS1) 10RLazarus: httpbb: Add missing query_service directory [puppet] - 10https://gerrit.wikimedia.org/r/803581 [16:48:22] (03PS3) 10Jbond: puppetmaster: update private repo pre-commit to error un-staged [puppet] - 10https://gerrit.wikimedia.org/r/803560 [16:48:28] (03PS2) 10Muehlenhoff: imagecatalog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803576 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [16:48:38] (03CR) 10Ayounsi: [C: 03+1] P:netbox: add REQUESTS_CA_BUNDLE variable to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/803579 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [16:49:46] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35787/console" [puppet] - 10https://gerrit.wikimedia.org/r/803581 (owner: 10RLazarus) [16:50:02] (03CR) 10Jbond: [V: 03+1 C: 03+2] P:netbox: add REQUESTS_CA_BUNDLE variable to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/803579 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [16:50:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:50:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:50:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:41] (03CR) 10RLazarus: [V: 03+1 C: 03+2] "Just FYI, no action needed -- my fault for missing this in review :)" [puppet] - 10https://gerrit.wikimedia.org/r/803581 (owner: 10RLazarus) [16:52:56] RECOVERY - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:53:40] RECOVERY - Check unit status of netbox_ganeti_esams_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:54:52] RECOVERY - Check unit status of netbox_ganeti_eqiad_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:54:53] (03CR) 10Lucas Werkmeister (WMDE): httpbb: Add missing query_service directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803581 (owner: 10RLazarus) [16:55:27] (03PS1) 10Jbond: netbox: also add CA bundle environment to netbox_dump_run job [puppet] - 10https://gerrit.wikimedia.org/r/803582 (https://phabricator.wikimedia.org/T296452) [16:55:40] (03CR) 10Jbond: [C: 03+2] netbox: also add CA bundle environment to netbox_dump_run job [puppet] - 10https://gerrit.wikimedia.org/r/803582 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [16:55:58] PROBLEM - BGP status on cloudsw2-d5-eqiad.mgmt is CRITICAL: BGP CRITICAL - The requested table is empty or does not exist https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:56:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:19] fixed 👍 [16:57:33] (03CR) 10Jbond: [C: 03+2] kubeadm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803569 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [16:59:00] (03PS2) 10Jbond: kubeadm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803569 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [16:59:14] RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:00:05] (03CR) 10Zabe: [C: 03+1] "Thanks, missed that MIT license." [puppet] - 10https://gerrit.wikimedia.org/r/803569 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:00:08] (03PS1) 10Dduvall: logstash_checker.py: Change `--delay` argument type to float [puppet] - 10https://gerrit.wikimedia.org/r/803583 (https://phabricator.wikimedia.org/T308068) [17:00:22] RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:00:48] RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:00:48] RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:00:48] RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:00:52] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 115 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:01:27] (03CR) 10Jbond: [C: 04-1] "Just going to -01 this until i can check with moritz what to do about the GPL file" [puppet] - 10https://gerrit.wikimedia.org/r/803569 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:01:54] (03PS2) 10Jbond: kibana: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803570 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:01:59] (03CR) 10Jbond: [C: 03+2] kibana: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803570 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:02:06] (03PS2) 10Jbond: kerberos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803571 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:02:20] (03CR) 10Jbond: [C: 03+2] kerberos: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803571 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:02:29] (03PS2) 10Jbond: keepalived: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803572 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:02:41] (03CR) 10Jbond: [C: 03+2] keepalived: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803572 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:03:09] (03CR) 10Jbond: [C: 03+2] kafkatee: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803574 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:03:18] (03PS2) 10Jbond: kafkatee: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803574 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:03:22] (03CR) 10Jbond: [V: 03+2] kafkatee: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803574 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:04:20] (03CR) 10Jbond: [C: 03+2] httpd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803577 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:04:24] (03CR) 10Ahmon Dancy: [C: 03+1] logstash_checker.py: Change `--delay` argument type to float [puppet] - 10https://gerrit.wikimedia.org/r/803583 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [17:04:42] (03CR) 10Jbond: [C: 03+2] imagecatalog: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803576 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:05:01] (03PS2) 10Jbond: httpd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803577 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:05:06] (03CR) 10Jbond: [V: 03+2] httpd: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803577 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:05:57] zabe: thanks again for all teh SPDX patches [17:06:06] (03CR) 10RLazarus: [C: 03+1] "LGTM! Thanks for adding the tests, and for your patience. Let me know if it's ready to merge from your end, and I'll get it taken care of." [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) (owner: 10Lucas Werkmeister (WMDE)) [17:06:21] i did not schedule this in the window, but i just so happen to have a puppet patch that will unblock train https://gerrit.wikimedia.org/r/c/operations/puppet/+/803583 [17:06:52] anyone able to pick this up? it's a fairly simple change to `logstash_checker.py` fwiw [17:07:23] dduvall: lgtm will merge [17:07:33] (03CR) 10Jbond: [C: 03+2] logstash_checker.py: Change `--delay` argument type to float [puppet] - 10https://gerrit.wikimedia.org/r/803583 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [17:07:40] awesome! thank you very much [17:07:55] 10SRE-OnFire (FY2021/2022-Q4), 10observability, 10SRE Observability (FY2021/2022-Q4): Make 'status page' dashboard the default dashboard in Grafana - https://phabricator.wikimedia.org/T305954 (10herron) 05Open→03Resolved a:03herron >>! In T305954#7855342, @herron wrote: > https://grafana-rw.wikimedia.o... [17:07:59] 10SRE, 10SRE-OnFire (FY2021/2022-Q4), 10Observability-Alerting: Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10herron) [17:08:06] np dduvall where dos it need deploying? [17:08:29] oh right. sorry. deploy1002 [17:08:34] yw [17:09:18] (03CR) 10Lucas Werkmeister (WMDE): query_service: don’t cache index files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) (owner: 10Lucas Werkmeister (WMDE)) [17:10:32] dduvall: deployed now [17:10:36] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:41] thanks again [17:11:44] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1009 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [17:13:04] !log dduvall@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.15 refs T308068 [17:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:09] T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068 [17:13:33] (03CR) 10Muehlenhoff: kubeadm: Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803569 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:13:41] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/802543 (owner: 10Muehlenhoff) [17:16:14] (03PS3) 10Jbond: kubeadm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803569 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:17:59] (03CR) 10Jbond: [C: 03+2] kubeadm: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/803569 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [17:20:07] (03PS1) 10Bartosz Dziewoński: Disable (instead of hiding) preferences that would have no effect [extensions/DiscussionTools] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803533 [17:20:36] (03PS2) 10Bartosz Dziewoński: Add preference for offering new topic tool when creating new talk pages [extensions/DiscussionTools] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803527 (https://phabricator.wikimedia.org/T297990) [17:20:44] (03PS2) 10Bartosz Dziewoński: Show createpage preference only when feature is available [extensions/DiscussionTools] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803528 (https://phabricator.wikimedia.org/T310053) [17:21:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM webperf2004.codfw.wmnet [17:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:50] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [17:22:14] PROBLEM - BGP status on cloudsw2-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - The requested table is empty or does not exist https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:22:45] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Smalyshev) [17:23:35] (03PS1) 10Volans: sre.hosts.dhcp: fix usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/803584 [17:28:12] (03CR) 10Volans: [C: 03+2] sre.hosts.dhcp: fix usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/803584 (owner: 10Volans) [17:29:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM webperf2004.codfw.wmnet [17:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:28] (03PS1) 10Muehlenhoff: Add to Stanislav Malyshev to contributors [puppet] - 10https://gerrit.wikimedia.org/r/803585 (https://phabricator.wikimedia.org/T308013) [17:30:18] (03PS1) 10Cwhite: logstash: canary curator fork on codfw [puppet] - 10https://gerrit.wikimedia.org/r/803586 (https://phabricator.wikimedia.org/T301017) [17:30:20] (03PS1) 10Cwhite: opensearch: ensure elasticsearch-curator on opensearch compatible fork [puppet] - 10https://gerrit.wikimedia.org/r/803587 (https://phabricator.wikimedia.org/T301017) [17:30:22] (03PS1) 10Cwhite: opensearch: disable compatibility mode [puppet] - 10https://gerrit.wikimedia.org/r/803588 (https://phabricator.wikimedia.org/T301017) [17:30:30] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [17:31:14] (03Merged) 10jenkins-bot: sre.hosts.dhcp: fix usage example [cookbooks] - 10https://gerrit.wikimedia.org/r/803584 (owner: 10Volans) [17:32:12] (03CR) 10Muehlenhoff: [C: 03+2] Add to Stanislav Malyshev to contributors [puppet] - 10https://gerrit.wikimedia.org/r/803585 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [17:36:23] !log pt1979@cumin1001 START - Cookbook sre.hosts.dhcp for host clouddumps1001.wikimedia.org [17:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:41] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host clouddumps1001.wikimedia.org [17:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:38] !log pt1979@cumin1001 START - Cookbook sre.hosts.dhcp for host clouddumps1001.wikimedia.org [17:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:13] jouncebot: next [17:40:13] In 0 hour(s) and 19 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220607T1800) [17:40:46] (03PS6) 10RLazarus: query_service: don’t cache index files [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) (owner: 10Lucas Werkmeister (WMDE)) [17:43:26] !log dduvall@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.15 refs T308068 (duration: 30m 22s) [17:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:30] T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068 [17:44:16] (03CR) 10RLazarus: [C: 03+2] query_service: don’t cache index files [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) (owner: 10Lucas Werkmeister (WMDE)) [17:45:22] !log dduvall@deploy1002 Pruned MediaWiki: 1.39.0-wmf.13 (duration: 01m 49s) [17:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:42] (03CR) 10Andrew Bogott: [C: 03+2] Add tenacity lib and retry logic [puppet] - 10https://gerrit.wikimedia.org/r/803340 (owner: 10Nskaggs) [17:50:41] (03PS1) 10Slyngshede: logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) [17:51:40] (03CR) 10CI reject: [V: 04-1] logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [17:51:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:15] (03PS2) 10Slyngshede: logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) [17:58:12] (03CR) 10RLazarus: [C: 03+2] query_service: don’t cache index files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/799297 (https://phabricator.wikimedia.org/T289243) (owner: 10Lucas Werkmeister (WMDE)) [17:58:21] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:58:22] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:24] 10SRE, 10ops-eqiad, 10serviceops: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Cmjohnson) Dell tech should be here tomorrow or Thursday to fix. [17:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:12] (03CR) 10CI reject: [V: 04-1] logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [18:00:04] dduvall and jeena: My dear minions, it's time we take the moon! Just kidding. Time for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220607T1800). [18:00:32] (03PS3) 10Slyngshede: logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) [18:01:18] (03PS1) 10Dduvall: group0 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803591 (https://phabricator.wikimedia.org/T308068) [18:01:22] (03CR) 10Dduvall: [C: 03+2] group0 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803591 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [18:01:39] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Cmjohnson) [18:01:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:54] (03CR) 10CI reject: [V: 04-1] logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [18:02:08] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 116 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [18:02:42] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/803591 (https://phabricator.wikimedia.org/T308068) (owner: 10Dduvall) [18:03:11] (03PS4) 10Slyngshede: logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) [18:04:15] (03CR) 10jenkins-bot: logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [18:04:55] (03CR) 10Nskaggs: Add tenacity lib and retry logic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/803340 (owner: 10Nskaggs) [18:05:16] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install aqs1016-aqs1021 - https://phabricator.wikimedia.org/T305570 (10Cmjohnson) [18:05:19] (03PS5) 10Slyngshede: logster::job migrate cron to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) [18:06:30] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.15 refs T308068 [18:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:06:33] T308068: 1.39.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T308068 [18:06:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:13] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/35788/console" [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [18:21:11] (03CR) 10JMeybohm: [C: 03+1] "Looks good!" [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801645 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [18:21:35] (03CR) 10JMeybohm: [C: 03+1] Run isort/black on the codebase [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801642 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [18:22:43] (03CR) 10JMeybohm: [C: 03+1] Use etcdmirror namespace for metrics (031 comment) [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/801644 (https://phabricator.wikimedia.org/T309546) (owner: 10Filippo Giunchedi) [18:26:16] (03CR) 10Slyngshede: [V: 03+1] "Reworked the systemd timer replacement for logster. The original version didn't deal to well with complex option sets on the logster comma" [puppet] - 10https://gerrit.wikimedia.org/r/803590 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede) [18:33:02] RECOVERY - Check systemd state on netbox2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:33:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:17] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Keystone: support creation of additional domains [puppet] - 10https://gerrit.wikimedia.org/r/803567 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [18:35:22] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: Include config for 'magnum' service domain in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/803568 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [18:40:26] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [18:40:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:40:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:16] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1004-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:46:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:33] (03PS1) 10Andrew Bogott: OpenStack Magnum: use a service user 'magnum' as admin of 'magnum' domain [puppet] - 10https://gerrit.wikimedia.org/r/803593 (https://phabricator.wikimedia.org/T280792) [18:51:10] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Magnum: use a service user 'magnum' as admin of 'magnum' domain [puppet] - 10https://gerrit.wikimedia.org/r/803593 (https://phabricator.wikimedia.org/T280792) (owner: 10Andrew Bogott) [18:51:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ores-admin for ml-team-admins - https://phabricator.wikimedia.org/T310044 (10CDanis) a:03calbon [18:59:16] (03PS1) 10Andrew Bogott: magnum policy.yaml: fix admin_or_projectadmin rules [puppet] - 10https://gerrit.wikimedia.org/r/803595 [19:01:00] (03CR) 10Andrew Bogott: [C: 03+2] magnum policy.yaml: fix admin_or_projectadmin rules [puppet] - 10https://gerrit.wikimedia.org/r/803595 (owner: 10Andrew Bogott) [19:13:24] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:13:52] (03PS4) 10Ayounsi: Add python3.10 support to Tox [cookbooks] - 10https://gerrit.wikimedia.org/r/803263 [19:13:54] (03PS11) 10Ayounsi: Initial support for servers switch interfaces [cookbooks] - 10https://gerrit.wikimedia.org/r/803261 [19:15:46] PROBLEM - puppet last run on thumbor2006 is CRITICAL: CRITICAL: Puppet has been disabled for 605154 seconds, message: foo, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [19:15:56] 🤔 [19:20:56] PROBLEM - Router interfaces on cr3-eqsin is CRITICAL: CRITICAL: host 103.102.166.131, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:22:02] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1009 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [19:22:14] PROBLEM - mediawiki originals uploads -hourly- for codfw on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2009 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [19:23:04] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:33:24] (03PS1) 10Andrew Bogott: Use name rather than ID for magnum domain conf [puppet] - 10https://gerrit.wikimedia.org/r/803596 [19:34:33] (03CR) 10Andrew Bogott: [C: 03+2] Use name rather than ID for magnum domain conf [puppet] - 10https://gerrit.wikimedia.org/r/803596 (owner: 10Andrew Bogott) [19:36:01] 10ops-eqiad, 10Cloud-Services, 10DC-Ops, 10User-dcaro: move cloudcephmon1002.eqiad.wmnet from rack B4 to rack D5 - https://phabricator.wikimedia.org/T304096 (10nskaggs) p:05Triage→03Low [19:41:37] (03CR) 10Cwhite: "LGTM overall" [alerts] - 10https://gerrit.wikimedia.org/r/803368 (https://phabricator.wikimedia.org/T300723) (owner: 10BCornwall) [19:46:28] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/803553 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [19:47:52] I'll just sling out a couple of config patches a few minutes early rather than hold up the deploy window. [19:47:58] (03CR) 10Jforrester: [C: 03+2] extdist: 1.38 is now stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802612 (owner: 10MacFan4000) [19:49:24] (03CR) 10Cwhite: "The curator fork has been running on beta for a couple weeks now and seems to be doing the right thing. I propose we run it on codfw (sta" [puppet] - 10https://gerrit.wikimedia.org/r/803586 (https://phabricator.wikimedia.org/T301017) (owner: 10Cwhite) [19:50:15] yeah, sorry, i really filled up the backport window all by myself [19:50:24] MatmaRex: It's fine. :-) [19:50:43] it'd be great if anyone wanted to do me a favor and +2 some patches early [19:50:52] i also want to backport one with a new localisation message :/ [19:51:15] MatmaRex: Oy. I've got a meeting so I shouldn't… [19:51:29] (03PS3) 10Jforrester: extdist: 1.38 is now stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802612 (owner: 10MacFan4000) [19:51:39] (03CR) 10Jforrester: "New improved gerrit, huh?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802612 (owner: 10MacFan4000) [19:52:30] (03Merged) 10jenkins-bot: extdist: 1.38 is now stable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802612 (owner: 10MacFan4000) [19:52:34] Finally. [19:52:49] (03PS2) 10Jforrester: extdist: Drop 1.36, now EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802772 (https://phabricator.wikimedia.org/T309864) [19:52:52] (03CR) 10Jforrester: [C: 03+2] extdist: Drop 1.36, now EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802772 (https://phabricator.wikimedia.org/T309864) (owner: 10Jforrester) [19:53:35] (03Merged) 10jenkins-bot: extdist: Drop 1.36, now EOL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/802772 (https://phabricator.wikimedia.org/T309864) (owner: 10Jforrester) [19:54:30] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [19:56:43] (03PS1) 10Ahmon Dancy: mediawiki chart 0.2.3: Add random chars to name of test job [deployment-charts] - 10https://gerrit.wikimedia.org/r/803597 [19:56:52] * James_F sighs at scap. [19:57:01] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:802612|extdist: 1.38 is now stable]] (duration: 03m 44s) [19:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:57] James_F: What's up? [20:00:05] RoanKattouw, Urbanecm, and cjming: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220607T2000). [20:00:05] MatmaRex, jan_drewniak, and James_F: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:25] I can deploy today [20:00:25] dancy: The fpm-restart status update keeps updating backwards, jumping just now from 84% done to 77% done then 88%. [20:00:32] (03CR) 10JHathaway: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/803548 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [20:00:37] !log jforrester@deploy1002 Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:802772|extdist: Drop 1.36, now EOL (T309864)]] (duration: 03m 26s) [20:00:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:41] T309864: Tidy up references to REL1_36 now it is EOL - https://phabricator.wikimedia.org/T309864 [20:00:43] urbanecm: Scap's just done, I'm clear, good luck. [20:00:47] thanks [20:00:49] hi [20:01:01] yeah, scap sync-file takes 3min+ those days, unfortunately :/ [20:01:07] hello MatmaRex [20:01:12] urbanecm: one of my patches has a new localisation message and will need scap (right?), is that okay? [20:01:13] So much for 40s deploys. [20:01:30] It'll need scap-world, I presume you mean. [20:01:38] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/803527/2/i18n/en.json [20:01:57] thanks, i forget what is the name of that these days [20:02:50] MatmaRex: it looks your changes don't depend on each other, and it looks it should be possible to just sync-world all of your backports at once. [20:02:56] (in that case, that'd be fine) [20:03:07] (03CR) 10Urbanecm: [C: 03+2] Show createpage preference only when feature is available [extensions/DiscussionTools] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/802965 (https://phabricator.wikimedia.org/T310053) (owner: 10Bluehill395) [20:03:12] (03CR) 10Urbanecm: [C: 03+2] Add CSS class 'mw-htmlform-checkradio-indent' for indenting form fields [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803526 (owner: 10Bartosz Dziewoński) [20:03:16] (03CR) 10Urbanecm: [C: 03+2] Disable (instead of hiding) preferences that would have no effect [extensions/DiscussionTools] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803533 (owner: 10Bartosz Dziewoński) [20:03:20] thanks [20:03:35] (03CR) 10Urbanecm: [C: 03+2] Show createpage preference only when feature is available [extensions/DiscussionTools] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803528 (https://phabricator.wikimedia.org/T310053) (owner: 10Bartosz Dziewoński) [20:03:42] (03CR) 10Urbanecm: [C: 03+2] Add preference for offering new topic tool when creating new talk pages [extensions/DiscussionTools] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803527 (https://phabricator.wikimedia.org/T297990) (owner: 10Bartosz Dziewoński) [20:04:01] note, one patch in wmf.15 and others in wmf.14, and one patch in core and others in DiscussionTools (and one config at the end). i've got a bit of a mess today [20:04:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:04:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:04:27] MatmaRex: so long you don't mind testing all of those at once, it should be fine :) [20:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:34] yeah, i'd in fact prefer that :) [20:04:39] even better! [20:04:58] does the config depend on backports? or can we do it independently? [20:05:14] it depends on the backports [20:06:09] okay, good to know [20:06:33] in that case, I'll let you know once it's all ready for testing [20:08:03] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:07] jan_drewniak: hi, are you around for your deployment? [20:09:47] (03Merged) 10jenkins-bot: Show createpage preference only when feature is available [extensions/DiscussionTools] (wmf/1.39.0-wmf.15) - 10https://gerrit.wikimedia.org/r/802965 (https://phabricator.wikimedia.org/T310053) (owner: 10Bluehill395) [20:12:40] RECOVERY - BGP status on cr3-eqsin is OK: BGP OK - up: 326, down: 22, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:13:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:14:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:17:31] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:01] (03Merged) 10jenkins-bot: Add CSS class 'mw-htmlform-checkradio-indent' for indenting form fields [core] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803526 (owner: 10Bartosz Dziewoński) [20:20:05] (03Merged) 10jenkins-bot: Disable (instead of hiding) preferences that would have no effect [extensions/DiscussionTools] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803533 (owner: 10Bartosz Dziewoński) [20:20:09] (03Merged) 10jenkins-bot: Add preference for offering new topic tool when creating new talk pages [extensions/DiscussionTools] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803527 (https://phabricator.wikimedia.org/T297990) (owner: 10Bartosz Dziewoński) [20:20:13] (03Merged) 10jenkins-bot: Show createpage preference only when feature is available [extensions/DiscussionTools] (wmf/1.39.0-wmf.14) - 10https://gerrit.wikimedia.org/r/803528 (https://phabricator.wikimedia.org/T310053) (owner: 10Bartosz Dziewoński) [20:22:26] here we go :) [20:22:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:46] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1002 is CRITICAL: CRITICAL - elasticsearch http://localhost:9400/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9400): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7fc8d48cd630: Failed to establish a new connection: [Errno 111] Connection refu [20:23:46] ttps://wikitech.wikimedia.org/wiki/Search%23Administration [20:24:32] MatmaRex: all pulled to mwdebug1001. can you have a look please? [20:24:35] or do i need to pull config too? [20:24:51] no, looking [20:25:11] thanks [20:25:25] urbanecm: wmf.15 looks good, i'm testing the others [20:26:04] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1002 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: number_of_nodes: 6, number_of_data_nodes: 6, unassigned_shards: 0, cluster_name: cloudelastic-omega-eqiad, active_primary_shards: 768, number_of_in_flight_fetch: 0, timed_out: False, relocating_shards: 2, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0, active_shards: 1538, ini [20:26:04] g_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, status: green https://wikitech.wikimedia.org/wiki/Search%23Administration [20:26:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:26:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:26:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:15] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:27:24] urbanecm: om wmf.14, the new l10n messages are missing [20:27:38] MatmaRex: at this stage, that's expected. they will appear after the sync-world only. [20:27:47] right. everything else looks good [20:27:58] great! in that case, pressing the buttons [20:28:39] !log urbanecm@deploy1002 Started scap: DiscussionTools backports + r803526 (T310053, T297990) [20:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:44] T310053: DiscussionTools throws an exception in preferences. - https://phabricator.wikimedia.org/T310053 [20:28:44] T297990: Clarify how the "Enable quick topic adding" setting affects creating new talk pages (separate setting?) - https://phabricator.wikimedia.org/T297990 [20:28:52] this will take a while. [20:28:55] thanks [20:28:59] np [20:32:32] PROBLEM - mediawiki originals uploads -hourly- for eqiad on alert1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1009 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To%23mediawiki_originals_uploads https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [20:40:53] (03PS1) 10JHathaway: exim: update comment on BDAT issue [puppet] - 10https://gerrit.wikimedia.org/r/803601 [20:43:10] MatmaRex: i just noticed comment for wgDiscussionTools_newtopictool in https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/801820 says "Keep in sync with wgDiscussionTools_sourcemodetoolbar below". looks like that's no longer the case. is that intended/expected? and if it is, should the comment be removed? [20:44:50] 10SRE-OnFire, 10SRE Observability (FY2021/2022-Q4): Set up POC Dispatch environment and evaluate its viability - https://phabricator.wikimedia.org/T309033 (10herron) New release out today which appears to address some of the issues outlined above https://github.com/Netflix/dispatch/releases/tag/v20220607 Also... [20:45:57] urbanecm: it's still the case, and they are in sync. it just doesn't look obvious because of all the random per-wiki overrides [20:46:26] ah, ok. i likely misunderstood the comment then (i thought it means they need to be the exact same value) [20:46:47] urbanecm: if both 'replytool' and 'newtopictool' are set to 'default' (which means "beta feature"), then 'sourcemodetoolbar' needs to also be set to 'default' [20:46:59] if either of them is 'available', then it should aso be 'available' [20:47:06] got it [20:47:17] thanks for clarifying that! [20:47:22] this is basically a workaround for a bug in our code, sorry that it sucks [20:48:10] i was hoping that the rollout would be faster, and it seemed not worth fixing ;) [20:48:18] i was just slightly worried there's some oversight, but i just imagined the "sync" incorrectly -- sorry for that :). [20:48:29] it's not an issue from my side [20:48:30] yeah, it's not phrased well [20:52:02] (03PS4) 10Urbanecm: Make new topic tool available as opt-out almost everywhere (phase 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801820 (https://phabricator.wikimedia.org/T309368) (owner: 10Bartosz Dziewoński) [20:52:08] (03CR) 10Urbanecm: [C: 03+2] Make new topic tool available as opt-out almost everywhere (phase 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801820 (https://phabricator.wikimedia.org/T309368) (owner: 10Bartosz Dziewoński) [20:52:24] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:23] !log urbanecm@deploy1002 Finished scap: DiscussionTools backports + r803526 (T310053, T297990) (duration: 24m 43s) [20:53:23] (03Merged) 10jenkins-bot: Make new topic tool available as opt-out almost everywhere (phase 3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/801820 (https://phabricator.wikimedia.org/T309368) (owner: 10Bartosz Dziewoński) [20:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:27] T310053: DiscussionTools throws an exception in preferences. - https://phabricator.wikimedia.org/T310053 [20:53:28] T297990: Clarify how the "Enable quick topic adding" setting affects creating new talk pages (separate setting?) - https://phabricator.wikimedia.org/T297990 [20:53:30] and, just in time :) [20:53:45] MatmaRex: your config patch is at mwdebug1001. can you check? [20:54:00] looking [20:56:34] urbanecm: looks good [20:56:40] thanks, syncing [20:58:39] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:58:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:41] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: afc847a6865be01dff94653feae0d4c9fc9952f6: Make new topic tool available as opt-out almost everywhere; phase 3 (T309368) (duration: 03m 37s) [21:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:44] MatmaRex: it's live. i guess we're done now? :) [21:00:44] T309368: [Config Change] Enable New Topic Tool as opt-out at Phase 3 wikis (desktop) - https://phabricator.wikimedia.org/T309368 [21:01:16] urbanecm: yes. thank you! [21:01:20] happy to help! [21:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:04:44] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1003-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:09:48] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:09:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:16:34] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:06] (03PS2) 10JHathaway: exim: update comment on BDAT issue [puppet] - 10https://gerrit.wikimedia.org/r/803601 (https://phabricator.wikimedia.org/T307873) [21:17:24] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/803601 (https://phabricator.wikimedia.org/T307873) (owner: 10JHathaway) [21:17:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:28:51] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2020.codfw.wmnet [21:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:47] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2020.codfw.wmnet [21:36:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:49] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2021.codfw.wmnet [21:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:53] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2021.codfw.wmnet [21:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:54] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2022.codfw.wmnet [21:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:51] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2022.codfw.wmnet [21:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:53] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2023.codfw.wmnet [22:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:38] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2023.codfw.wmnet [22:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:39] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2024.codfw.wmnet [22:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:01] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2024.codfw.wmnet [22:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:03] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2025.codfw.wmnet [22:27:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:09] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2025.codfw.wmnet [22:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:11] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2026.codfw.wmnet [22:38:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:09] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2026.codfw.wmnet [22:44:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:10] !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc2027.codfw.wmnet [22:49:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:56] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2027.codfw.wmnet [22:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log