[00:10:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P31683 and previous config saved to /var/cache/conftool/dbconfig/20220722-001056-ladsgroup.json [00:14:58] 10SRE, 10DNS, 10Traffic, 10WMF-Legal, and 4 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) 05In progress→03Resolved a:03Dzahn curl https://policy.wikimedia.org ..

The document has moved !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2021.codfw.wmnet with OS bullseye [05:55:10] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [05:55:13] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2021.codfw.wmnet with OS bullseye completed: - ganeti2021 (**PASS**) - Downtimed on... [05:57:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2014.codfw.wmnet with OS bullseye [05:57:34] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS bullseye [06:13:48] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2014.codfw.wmnet with reason: host reimage [06:13:49] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [06:13:58] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [06:16:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2014.codfw.wmnet with reason: host reimage [06:21:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet [06:21:45] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [06:24:52] (03PS1) 10Muehlenhoff: facilities: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816100 (https://phabricator.wikimedia.org/T308013) [06:26:47] (03PS1) 10Muehlenhoff: iegreview: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816101 (https://phabricator.wikimedia.org/T308013) [06:27:37] (03CR) 10CI reject: [V: 04-1] iegreview: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816101 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [06:29:04] (03PS2) 10Muehlenhoff: iegreview: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816101 (https://phabricator.wikimedia.org/T308013) [06:30:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet [06:30:36] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [06:31:33] (03PS1) 10Muehlenhoff: librenms: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816102 (https://phabricator.wikimedia.org/T308013) [06:31:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2014.codfw.wmnet with OS bullseye [06:32:00] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti2014.codfw.wmnet with OS bullseye completed: - ganeti2014 (**PASS**) - Downtimed on... [06:35:53] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [06:38:26] (03PS1) 10Muehlenhoff: pmacct: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816104 (https://phabricator.wikimedia.org/T308013) [06:50:57] (03PS1) 10Muehlenhoff: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816105 (https://phabricator.wikimedia.org/T308013) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220722T0700) [07:03:38] (03CR) 10Ayounsi: "Looking at https://atlas.ripe.net/results/maps/density/ a lot of African countries have 0 or very few probes." [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) (owner: 10BCornwall) [07:05:18] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Danielgblack) Looking into how it might be possible to grab stack traces for the next stall if it occurs. ` sudo -u mysql gcore $(... [07:19:38] 10SRE, 10Data-Engineering, 10Discovery: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10elukey) Previous occurrence: https://phabricator.wikimedia.org/T304224 @Tarrow Hi! We do have some backups about archiva's releases but in general we'd probab... [07:42:33] 10SRE-OnFire, 10DBA, 10Sustainability (Incident Followup): Investigate mariadb 10.6 performance regression during spikes/high load - https://phabricator.wikimedia.org/T311106 (10Danielgblack) @Ladsgroup thanks for the perf recordings. I don't know of a good interactive measure. I'm sorry to say I couldn't se... [07:48:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [07:48:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [07:48:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T312863)', diff saved to https://phabricator.wikimedia.org/P31696 and previous config saved to /var/cache/conftool/dbconfig/20220722-074844-ladsgroup.json [07:48:48] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [07:51:07] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:01:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T312863)', diff saved to https://phabricator.wikimedia.org/P31697 and previous config saved to /var/cache/conftool/dbconfig/20220722-080112-ladsgroup.json [08:01:16] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [08:08:49] (03CR) 10Filippo Giunchedi: netmon: Add suppport for multiple backup/passive nodes in Puppet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814848 (https://phabricator.wikimedia.org/T309074) (owner: 10Andrea Denisse) [08:16:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P31698 and previous config saved to /var/cache/conftool/dbconfig/20220722-081617-ladsgroup.json [08:16:19] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [08:31:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P31699 and previous config saved to /var/cache/conftool/dbconfig/20220722-083122-ladsgroup.json [08:31:24] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [08:40:26] (03PS3) 10Alexandros Kosiaris: services_proxy: Move AF common stanza to separate template [puppet] - 10https://gerrit.wikimedia.org/r/815957 [08:40:28] (03PS3) 10Alexandros Kosiaris: services_proxy: Allow having both v4 and v6 AF enabled [puppet] - 10https://gerrit.wikimedia.org/r/815958 [08:40:30] (03PS3) 10Alexandros Kosiaris: services_proxy: Listen on :: and not ::1 [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) [08:42:37] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36350/console" [puppet] - 10https://gerrit.wikimedia.org/r/815957 (owner: 10Alexandros Kosiaris) [08:43:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:43:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2121.codfw.wmnet with reason: Maintenance [08:43:48] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 12 hosts with reason: Maintenance [08:44:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 12 hosts with reason: Maintenance [08:45:42] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/816105 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:46:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T312863)', diff saved to https://phabricator.wikimedia.org/P31700 and previous config saved to /var/cache/conftool/dbconfig/20220722-084627-ladsgroup.json [08:46:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [08:46:33] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [08:46:34] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [08:46:38] (03CR) 10Jbond: [C: 03+1] facilities: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816100 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:46:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1144.eqiad.wmnet with reason: Maintenance [08:46:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T312863)', diff saved to https://phabricator.wikimedia.org/P31701 and previous config saved to /var/cache/conftool/dbconfig/20220722-084647-ladsgroup.json [08:47:06] (03CR) 10Jbond: [C: 03+1] librenms: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816102 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:48:17] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/816104 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [08:50:57] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36351/console" [puppet] - 10https://gerrit.wikimedia.org/r/815958 (owner: 10Alexandros Kosiaris) [08:57:15] (03PS4) 10Alexandros Kosiaris: services_proxy: Allow having both v4 and v6 AF enabled [puppet] - 10https://gerrit.wikimedia.org/r/815958 [08:57:17] (03PS4) 10Alexandros Kosiaris: services_proxy: Listen on :: and not ::1 [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) [09:18:37] (03PS5) 10Alexandros Kosiaris: services_proxy: Allow having both v4 and v6 AF enabled [puppet] - 10https://gerrit.wikimedia.org/r/815958 [09:18:39] (03PS5) 10Alexandros Kosiaris: services_proxy: Listen on :: and not ::1 [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) [09:23:15] (03PS1) 10Hashar: POST events asynchronously [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816115 [09:23:46] 10SRE, 10Deployments, 10bacula, 10Parsoid (Tracking), 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10jcrespo) @elukey We didn't receive any bad reports so far, should we be good to close this task or... [09:24:07] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 20 DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36352/console" [puppet] - 10https://gerrit.wikimedia.org/r/815958 (owner: 10Alexandros Kosiaris) [09:29:03] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS (NOOP 20 DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36353/console" [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [09:32:08] does anyone happen to know if WikimediaMaintenance/renameWiki.php has been used recently? (perhaps Amir1 or urbanecm?) [09:32:25] because I think I’ve found a bug in it that seems to date back to 2013 lol [09:32:48] Lucas_WMDE: it's not been used since be-tarask rename [09:32:52] (03PS2) 10Ayounsi: Netbox-next: Allow login from NDA users [puppet] - 10https://gerrit.wikimedia.org/r/815908 (https://phabricator.wikimedia.org/T302870) [09:32:53] 10SRE, 10Deployments, 10bacula, 10Parsoid (Tracking), 10Release-Engineering-Team (Doing): Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10RhinosF1) T309162 is still actionable from the incident. [09:33:02] and do you know if it worked at the time? [09:33:18] well, I guess the rename worked, eventually [09:33:19] no clue [09:33:27] yup [09:33:28] I’ll just upload a Gerrit patch showing what I mean with the bug [09:33:31] thanks [09:34:18] (03CR) 10Ayounsi: [C: 03+2] Netbox-next: Allow login from NDA users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/815908 (https://phabricator.wikimedia.org/T302870) (owner: 10Ayounsi) [09:34:20] (03CR) 10Muehlenhoff: [C: 03+2] librenms: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816102 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:34:26] (03CR) 10Ayounsi: [C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/815908 (https://phabricator.wikimedia.org/T302870) (owner: 10Ayounsi) [09:34:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [09:34:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [09:34:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T312863)', diff saved to https://phabricator.wikimedia.org/P31702 and previous config saved to /var/cache/conftool/dbconfig/20220722-093453-ladsgroup.json [09:34:58] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [09:35:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:35:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [09:35:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 12 hosts with reason: Maintenance [09:35:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 12 hosts with reason: Maintenance [09:36:23] (03CR) 10Muehlenhoff: [C: 03+2] pmacct: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816104 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:37:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:37:43] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:37:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:37:49] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:37:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T312984)', diff saved to https://phabricator.wikimedia.org/P31704 and previous config saved to /var/cache/conftool/dbconfig/20220722-093754-ladsgroup.json [09:37:58] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [09:39:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 10%: Maint finished', diff saved to https://phabricator.wikimedia.org/P31705 and previous config saved to /var/cache/conftool/dbconfig/20220722-093940-ladsgroup.json [09:42:14] (03PS1) 10Ayounsi: IDP: allow NDA for netbox and netbox-next [puppet] - 10https://gerrit.wikimedia.org/r/816118 (https://phabricator.wikimedia.org/T302870) [09:43:21] Amir1: here it is if you’re interested https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaMaintenance/+/816119 [09:43:59] (03PS1) 10Jcrespo: bacula: Setup backup[12]00[89] as new production and database backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/816120 (https://phabricator.wikimedia.org/T313582) [09:45:37] (03PS2) 10Jcrespo: bacula: Setup backup[12]00[89] as new production and database backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/816120 (https://phabricator.wikimedia.org/T313582) [09:48:29] (03CR) 10CI reject: [V: 04-1] bacula: Setup backup[12]00[89] as new production and database backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/816120 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [09:48:59] (03CR) 10Muehlenhoff: [C: 03+2] facilities: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816100 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:49:30] 10SRE, 10SRE-Access-Requests: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10Vgutierrez) p:05Triage→03Medium a:03Vgutierrez @MRaishWMF you need to use a dedicated SSH key for production access. Please generate it and update this ticket [09:50:30] Lucas_WMDE: that's weird, I'm sure we don't rename anything on the database level [09:50:40] (03PS3) 10Muehlenhoff: alertmanager: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812175 (https://phabricator.wikimedia.org/T308013) [09:50:40] probably that was some aspirations [09:51:28] (03PS3) 10Jcrespo: bacula: Setup backup[12]00[89] as new production and database backup hosts [puppet] - 10https://gerrit.wikimedia.org/r/816120 (https://phabricator.wikimedia.org/T313582) [09:51:41] 10SRE, 10SRE-Access-Requests: Requesting access to GLOBAL ROOT for Francesco Negri - https://phabricator.wikimedia.org/T313504 (10Vgutierrez) p:05Triage→03Medium a:03Vgutierrez [09:51:56] probably, yeah [09:53:44] is it still be_x_oldwiki at the db level? [09:54:40] Lucas_WMDE: yup [09:54:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 25%: Maint finished', diff saved to https://phabricator.wikimedia.org/P31706 and previous config saved to /var/cache/conftool/dbconfig/20220722-095444-ladsgroup.json [09:54:45] ok [09:54:59] do you know about the renaming wikis blocker? [09:55:02] so it’s maybe a good thing that that block didn’t do anything [09:55:06] https://phabricator.wikimedia.org/T172035 [09:55:08] but maybe I should adjust the comment [09:55:09] yeah :/ [09:55:24] (03CR) 10Muehlenhoff: [C: 03+2] alertmanager: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/812175 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [09:55:40] if the wikidata part gets fixed, we can probably try again ;) [09:59:26] (03PS1) 10Vgutierrez: admin: Add fnegri user [puppet] - 10https://gerrit.wikimedia.org/r/816124 (https://phabricator.wikimedia.org/T313504) [09:59:28] (03PS1) 10Vgutierrez: admin: Add fnegri to ops group [puppet] - 10https://gerrit.wikimedia.org/r/816125 (https://phabricator.wikimedia.org/T313504) [10:02:17] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (The Decommission Mission 💀): Add dancy to phabricator-roots - https://phabricator.wikimedia.org/T313551 (10Vgutierrez) p:05Triage→03Medium [10:06:18] !log push pfw policies - T313522 [10:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 75%: Maint finished', diff saved to https://phabricator.wikimedia.org/P31707 and previous config saved to /var/cache/conftool/dbconfig/20220722-100948-ladsgroup.json [10:10:47] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2021.codfw.wmnet to cluster codfw and group B [10:10:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2021.codfw.wmnet to cluster codfw and group B [10:11:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet [10:18:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet [10:18:27] jmm@cumin2002: Failed to log message to wiki. Somebody should check the error logs. [10:21:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2021.codfw.wmnet to cluster codfw and group B [10:22:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2021.codfw.wmnet to cluster codfw and group B [10:22:55] (03PS1) 10Kevin Bazira: ml-services: Add cswiki & enwiki articletopic isvcs to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/816127 (https://phabricator.wikimedia.org/T313307) [10:24:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1158 (re)pooling @ 100%: Maint finished', diff saved to https://phabricator.wikimedia.org/P31708 and previous config saved to /var/cache/conftool/dbconfig/20220722-102452-ladsgroup.json [10:30:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet [10:39:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet [10:49:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/816118 (https://phabricator.wikimedia.org/T302870) (owner: 10Ayounsi) [10:50:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti2014.codfw.wmnet to cluster codfw and group C [10:50:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/816125 (https://phabricator.wikimedia.org/T313504) (owner: 10Vgutierrez) [10:50:54] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2014.codfw.wmnet to cluster codfw and group C [10:51:25] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/816101 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [10:59:16] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/815957 (owner: 10Alexandros Kosiaris) [10:59:23] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/815958 (owner: 10Alexandros Kosiaris) [10:59:27] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [10:59:31] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/815959 (https://phabricator.wikimedia.org/T255568) (owner: 10Alexandros Kosiaris) [11:01:28] (03CR) 10Muehlenhoff: [C: 03+2] iegreview: Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/816101 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:06:53] (03CR) 10Ayounsi: [C: 03+2] IDP: allow NDA for netbox and netbox-next [puppet] - 10https://gerrit.wikimedia.org/r/816118 (https://phabricator.wikimedia.org/T302870) (owner: 10Ayounsi) [11:13:14] (03CR) 10Jbond: beaker: add initial beaker files (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) (owner: 10Jbond) [11:13:50] (03PS32) 10Jbond: beaker: add initial beaker files [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) [11:14:16] (03PS33) 10Jbond: beaker: add initial beaker files [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) [11:16:57] (03PS1) 10Majavah: netbox: fix multiple groups in cas config [puppet] - 10https://gerrit.wikimedia.org/r/816129 [11:17:54] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36355/console" [puppet] - 10https://gerrit.wikimedia.org/r/816129 (owner: 10Majavah) [11:20:11] (03CR) 10Ayounsi: [C: 03+1] netbox: fix multiple groups in cas config [puppet] - 10https://gerrit.wikimedia.org/r/816129 (owner: 10Majavah) [11:33:50] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/816129 (owner: 10Majavah) [11:34:08] (03CR) 10Ayounsi: [C: 03+2] netbox: fix multiple groups in cas config [puppet] - 10https://gerrit.wikimedia.org/r/816129 (owner: 10Majavah) [11:36:35] (03PS1) 10Jelto: gitlab_runner: add workaround for DNS issues in WMCS, fix images [puppet] - 10https://gerrit.wikimedia.org/r/816133 (https://phabricator.wikimedia.org/T311241) [11:39:11] (03CR) 10Ssingh: [C: 03+1] "Looks good basing it from the latency measurements in the commit message." [dns] - 10https://gerrit.wikimedia.org/r/816053 (owner: 10BCornwall) [11:44:41] (03CR) 10Jelto: [C: 03+2] gitlab_runner: add workaround for DNS issues in WMCS, fix images [puppet] - 10https://gerrit.wikimedia.org/r/816133 (https://phabricator.wikimedia.org/T311241) (owner: 10Jelto) [11:47:40] 10SRE, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Grant cn=nda some sort of read only access to Netbox - https://phabricator.wikimedia.org/T302870 (10ayounsi) After a few back and forth I granted the following permissions to the `nda` group: `lines=20 circuits | circuit termination circuit... [11:55:04] (03PS1) 10Filippo Giunchedi: prometheus: update blackbox check alerts runbook link [puppet] - 10https://gerrit.wikimedia.org/r/816135 (https://phabricator.wikimedia.org/T312947) [11:56:28] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:59:15] (03PS1) 10Filippo Giunchedi: sre: link to service-specific Runbook wikitech page [alerts] - 10https://gerrit.wikimedia.org/r/816136 (https://phabricator.wikimedia.org/T312947) [11:59:51] 10SRE, 10Ganeti, 10Infrastructure-Foundations: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686 (10MoritzMuehlenhoff) [12:11:29] (03PS1) 10Jaime Nuche: scap: allow `scap` user to login into deployment-prep hosts [puppet] - 10https://gerrit.wikimedia.org/r/816140 [12:22:33] 10SRE, 10Data-Engineering, 10Discovery: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10BTullis) I think I'd probably consider growing the disk in ganeti, which should be able to increase the headroom for us. https://wikitech.wikimedia.org/wiki/G... [12:31:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T312863)', diff saved to https://phabricator.wikimedia.org/P31709 and previous config saved to /var/cache/conftool/dbconfig/20220722-123135-ladsgroup.json [12:31:38] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [12:31:39] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [12:43:18] 10SRE, 10Infrastructure-Foundations: New Python base layer to manage users/groups in LDAP - https://phabricator.wikimedia.org/T313595 (10MoritzMuehlenhoff) [12:43:25] 10SRE, 10Infrastructure-Foundations: New Python base layer to manage users/groups in LDAP - https://phabricator.wikimedia.org/T313595 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:43:44] PROBLEM - Check for snapshots leaked by cinder backup agent on cloudcontrol1005 is CRITICAL: 12 snaps in the admin project https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/Check_for_snapshots_leaked_by_cinder_backup_agent [12:44:14] 10SRE, 10Infrastructure-Foundations, 10LDAP: New Python base layer to manage users/groups in LDAP - https://phabricator.wikimedia.org/T313595 (10taavi) [12:46:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P31710 and previous config saved to /var/cache/conftool/dbconfig/20220722-124640-ladsgroup.json [12:55:30] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [12:55:34] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [12:56:37] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bullseye [12:56:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye [12:57:34] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage [13:01:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P31711 and previous config saved to /var/cache/conftool/dbconfig/20220722-130145-ladsgroup.json [13:02:55] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage [13:06:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q3:(Need By: TBD) rack/setup/install frlog1002 - https://phabricator.wikimedia.org/T306839 (10Cmjohnson) 05Open→03Resolved @Jgreen updated the vlan to administration. [edit interfaces interface-range vlan-administration] member ge-0/0/33 { ..... [13:09:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ayounsi) 05Resolved→03Open From diffscan: ` STATUS HOST PORT PROTO OPREV CPREV DNS OPEN 208.80.154.150 7443 tcp 0 6 c... [13:11:50] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2033.codfw.wmnet with OS bullseye [13:11:51] bking@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [13:11:56] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2033.codfw.wmnet with OS bullseye [13:14:54] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/816124 (https://phabricator.wikimedia.org/T313504) (owner: 10Vgutierrez) [13:15:09] 10SRE, 10Infrastructure-Foundations, 10LDAP: New Python base layer to manage users/groups in LDAP - https://phabricator.wikimedia.org/T313595 (10Peachey88) [13:15:13] (03CR) 10Vgutierrez: [C: 03+2] admin: Add fnegri user [puppet] - 10https://gerrit.wikimedia.org/r/816124 (https://phabricator.wikimedia.org/T313504) (owner: 10Vgutierrez) [13:16:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T312863)', diff saved to https://phabricator.wikimedia.org/P31712 and previous config saved to /var/cache/conftool/dbconfig/20220722-131650-ladsgroup.json [13:16:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [13:16:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1006.eqiad.wmnet with OS bullseye [13:16:55] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [13:17:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudnet1006.eqiad.wmnet with OS bullseye comple... [13:17:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1123.eqiad.wmnet with reason: Maintenance [13:17:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T312863)', diff saved to https://phabricator.wikimedia.org/P31713 and previous config saved to /var/cache/conftool/dbconfig/20220722-131710-ladsgroup.json [13:17:26] 10SRE, 10Data-Engineering, 10Discovery: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10elukey) @BTullis another way could be to add a new disk of say 200G, format it and then mount `/var/lib/archiva` on it. [13:18:21] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) [13:19:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Cmjohnson) 05Open→03Resolved Thanks @papaul that worked. @Andrew all yours! [13:20:42] (03CR) 10Klausman: [C: 03+2] ml-services: Add cswiki & enwiki articletopic isvcs to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/816127 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [13:23:29] (03PS34) 10Jbond: beaker: add initial beaker files [puppet] - 10https://gerrit.wikimedia.org/r/809224 (https://phabricator.wikimedia.org/T253635) [13:25:18] (03Merged) 10jenkins-bot: ml-services: Add cswiki & enwiki articletopic isvcs to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/816127 (https://phabricator.wikimedia.org/T313307) (owner: 10Kevin Bazira) [13:26:14] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2033.codfw.wmnet with reason: host reimage [13:26:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] New package version 0.0.3-1 [software/python-poolcounter] - 10https://gerrit.wikimedia.org/r/815749 (owner: 10Giuseppe Lavagetto) [13:28:51] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2033.codfw.wmnet with reason: host reimage [13:30:16] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:30:18] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [13:30:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [13:32:11] (03PS1) 10Jbond: C:aptrepo: Add component for puppet 7 packages [puppet] - 10https://gerrit.wikimedia.org/r/816168 (https://phabricator.wikimedia.org/T313387) [13:44:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/816168 (https://phabricator.wikimedia.org/T313387) (owner: 10Jbond) [13:45:19] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2033.codfw.wmnet with OS bullseye [13:45:25] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2033.codfw.wmnet with OS bullseye completed: - elastic2033 (**PAS... [13:46:29] 10SRE, 10Data-Engineering, 10Discovery: archiva1002 is running low on space left in the root partition - https://phabricator.wikimedia.org/T313386 (10Gehel) Should we also think about releasing more projects to Maven Central and using Archiva mostly as a local cache? This would externalise the disk space issue. [13:48:05] (03PS1) 10Majavah: hieradata: close down cloudweb envoy port [puppet] - 10https://gerrit.wikimedia.org/r/816171 (https://phabricator.wikimedia.org/T305414) [13:48:31] (03PS1) 10Hashar: build: manage dependencies with rules_jvm_external [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816172 [13:48:46] (03PS2) 10Hashar: build: manage dependencies with rules_jvm_external [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816172 [13:49:40] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36356/console" [puppet] - 10https://gerrit.wikimedia.org/r/816171 (https://phabricator.wikimedia.org/T305414) (owner: 10Majavah) [13:59:17] (03PS1) 10Zabe: wikimedia.org: Add developers.wikimedia.org alias [dns] - 10https://gerrit.wikimedia.org/r/816174 (https://phabricator.wikimedia.org/T313597) [14:01:07] (03CR) 10Samtar: [C: 03+1] "LGTM! Also, TIL that we can use `docs.wikimedia.org` — nice!" [dns] - 10https://gerrit.wikimedia.org/r/816174 (https://phabricator.wikimedia.org/T313597) (owner: 10Zabe) [14:02:21] (03CR) 10Samtar: wikimedia.org: Add developers.wikimedia.org alias (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/816174 (https://phabricator.wikimedia.org/T313597) (owner: 10Zabe) [14:02:52] 🤦‍♀️ [14:04:11] (03CR) 10Jbond: [C: 03+2] C:aptrepo: Add component for puppet 7 packages [puppet] - 10https://gerrit.wikimedia.org/r/816168 (https://phabricator.wikimedia.org/T313387) (owner: 10Jbond) [14:04:13] (03PS1) 10Zabe: mediawiki: Redirect developers.wm.o to developer.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/816175 (https://phabricator.wikimedia.org/T313597) [14:06:04] (03PS2) 10Zabe: mediawiki: Redirect developers.wm.o to developer.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/816175 (https://phabricator.wikimedia.org/T313597) [14:07:40] (03PS3) 10Hashar: build: manage dependencies with rules_jvm_external [software/gerrit/plugins/events-wikimedia] - 10https://gerrit.wikimedia.org/r/816172 [14:10:42] (03CR) 10Zabe: "Thanks for catching this!" [puppet] - 10https://gerrit.wikimedia.org/r/802565 (https://phabricator.wikimedia.org/T308013) (owner: 10Zabe) [14:17:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T312863)', diff saved to https://phabricator.wikimedia.org/P31717 and previous config saved to /var/cache/conftool/dbconfig/20220722-141724-ladsgroup.json [14:17:29] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [14:17:42] (03PS1) 10Jbond: C:package_builder: add gem2deb required to build puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/816179 (https://phabricator.wikimedia.org/T313387) [14:21:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T312863)', diff saved to https://phabricator.wikimedia.org/P31718 and previous config saved to /var/cache/conftool/dbconfig/20220722-142150-ladsgroup.json [14:22:20] (03PS1) 10Muehlenhoff: Install gem2deb on package builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/816180 [14:22:53] (03CR) 10Vgutierrez: [C: 03+2] admin: Add fnegri to ops group [puppet] - 10https://gerrit.wikimedia.org/r/816125 (https://phabricator.wikimedia.org/T313504) (owner: 10Vgutierrez) [14:23:11] (03PS2) 10Vgutierrez: admin: Add fnegri to ops group [puppet] - 10https://gerrit.wikimedia.org/r/816125 (https://phabricator.wikimedia.org/T313504) [14:23:47] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/816179 (https://phabricator.wikimedia.org/T313387) (owner: 10Jbond) [14:25:27] (03CR) 10Jbond: [C: 03+2] C:package_builder: add gem2deb required to build puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/816179 (https://phabricator.wikimedia.org/T313387) (owner: 10Jbond) [14:25:55] jbond: ok to merge f038dd4a61? [14:25:59] (03CR) 10Ladsgroup: [C: 04-1] P:mariadb::grants: add cloudweb1003/1004 grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816026 (https://phabricator.wikimedia.org/T305414) (owner: 10Majavah) [14:26:01] yes please [14:26:20] done [14:26:29] thx [14:26:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to GLOBAL ROOT for Francesco Negri - https://phabricator.wikimedia.org/T313504 (10Vgutierrez) 05Open→03Resolved [14:28:10] (03Abandoned) 10Muehlenhoff: Install gem2deb on package builder hosts [puppet] - 10https://gerrit.wikimedia.org/r/816180 (owner: 10Muehlenhoff) [14:29:49] (03PS3) 10Majavah: P:mariadb::grants: add cloudweb1003/1004 grants [puppet] - 10https://gerrit.wikimedia.org/r/816026 (https://phabricator.wikimedia.org/T305414) [14:29:57] !log restarting tomcat on idp-test.w.o [14:29:59] (03CR) 10Majavah: P:mariadb::grants: add cloudweb1003/1004 grants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816026 (https://phabricator.wikimedia.org/T305414) (owner: 10Majavah) [14:30:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:09] (03PS4) 10Ladsgroup: P:mariadb::grants: add cloudweb1003/1004 grants [puppet] - 10https://gerrit.wikimedia.org/r/816026 (https://phabricator.wikimedia.org/T305414) (owner: 10Majavah) [14:32:15] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] P:mariadb::grants: add cloudweb1003/1004 grants [puppet] - 10https://gerrit.wikimedia.org/r/816026 (https://phabricator.wikimedia.org/T305414) (owner: 10Majavah) [14:32:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P31719 and previous config saved to /var/cache/conftool/dbconfig/20220722-143229-ladsgroup.json [14:34:06] 10SRE, 10SRE-Access-Requests: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10MRaishWMF) [14:34:42] 10SRE, 10SRE-Access-Requests: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10MRaishWMF) @Vgutierrez thanks, I updated with a new SSH key. Let me know if this is adequate, and thank you [14:36:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P31720 and previous config saved to /var/cache/conftool/dbconfig/20220722-143655-ladsgroup.json [14:39:29] PROBLEM - Zookeeper Server #page on conf1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [14:39:48] * Emperor here [14:39:56] conf1009 [14:40:04] checking the state of the server [14:40:05] PROBLEM - Zookeeper Server #page on conf1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [14:40:20] that's conf1008 going as well [14:40:23] oh, is it the entire service? [14:40:38] https://phabricator.wikimedia.org/T311407 [14:40:44] 10SRE, 10SRE-Access-Requests: Requesting access to private-data for Mikeraish (MRaishWMF) - https://phabricator.wikimedia.org/T313429 (10Vgutierrez) All good :) waiting for @GEscalante-WMF approval and from @Ottomata || @odimitrijevic [14:40:49] I think this is the nth time that the downtime has expired [14:40:54] ah, ok, [14:41:03] sorry, wasn't around last week [14:41:04] PROBLEM - Check systemd state on elastic2046 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:05] PROBLEM - Zookeeper Server #page on conf1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [14:41:25] !log bking@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2046.codfw.wmnet with OS bullseye [14:41:31] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host elastic2046.codfw.wmnet with OS bullseye [14:41:31] but maybe I can send a patch to prevent that in the future? [14:43:54] cdanis: DYK how long would be a sensible extension for the downtime? [14:44:20] Emperor: I have a better idea [14:44:22] I would ask akosiaris [14:44:34] (03PS1) 10Jcrespo: zookeeper: Disable notification on conf1007,conf1008,conf1009 [puppet] - 10https://gerrit.wikimedia.org/r/816181 (https://phabricator.wikimedia.org/T311407) [14:44:42] ^ [14:44:58] (03PS2) 10Jcrespo: zookeeper: Disable notifications on conf1007,conf1008,conf1009 [puppet] - 10https://gerrit.wikimedia.org/r/816181 (https://phabricator.wikimedia.org/T311407) [14:45:11] I'll ack the VO [14:45:13] <_joe_> I just got paged [14:45:18] it is a bit of a brute force [14:45:25] but it should work indefinitelly [14:45:30] (03PS4) 10Jaime Nuche: scap: allow `scap` user to login into deployment-prep scap targets [puppet] - 10https://gerrit.wikimedia.org/r/816140 (https://phabricator.wikimedia.org/T303559) [14:45:56] False alarm [14:46:02] oh [14:46:04] Expired downtime [14:46:06] will wait for akosiaris to think it it is a good solution [14:46:07] akosiaris: thanks [14:46:08] <_joe_> ok good news: escalation to batphone works [14:46:13] :) [14:46:20] akosiaris: https://gerrit.wikimedia.org/r/c/operations/puppet/+/816181 [14:46:29] I am assuming those are new hosts, WIP [14:46:37] _joe_: shouldn't it have gone to the working-hours oncall people first? [14:46:38] feel free to solve in another way :-D [14:46:45] <_joe_> Emperor: I thin it did [14:46:49] Emperor: i think it did [14:46:52] <_joe_> but wasn't acked wtihin 5 minutes [14:46:56] And as it wasn't ack, it went to the batphone [14:47:07] well, nope [14:47:15] I wasn't on call, btw [14:47:17] our oncall finished forty five minutes ago [14:47:31] maybe there is a gap [14:47:32] it happens [14:47:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P31721 and previous config saved to /var/cache/conftool/dbconfig/20220722-144734-ladsgroup.json [14:47:36] so it didn't go to business hours oncall at all. [14:47:43] between shifts [14:47:44] (03CR) 10Ahmon Dancy: [C: 03+1] scap: allow `scap` user to login into deployment-prep scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816140 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [14:47:50] there is an one-hour gap I can see in VO [14:47:50] akosiaris: in any case, presumably OK for me to resolve the VO incidents? [14:47:58] yeah, this week there's an hour in between [14:47:59] and it decided to page in that gap [14:48:21] Oh, yes, I got an email from nagios, then the VO email & SMS 5 mins later [14:48:28] actually, that might need fixing -- it still waited five minutes before going to the batphone [14:48:37] I raised this issue to managers and there is not much we can do about it, unless we make people work more hours [14:49:01] we don't have 24h coverage, so I agree [14:49:18] _joe_: nope, it appears there is a 1h gap right now in VO [14:49:50] would it have waited five minutes if we were outside of business hours entirely? [14:49:55] I guess that depends on the working hours of the European and US staff on any given week [14:50:13] But my understanding was that if there's a gap, it would simply page the batphone without having to wait to escalate [14:50:59] 10SRE, 10ops-eqiad: Degraded RAID on cloudweb1003 - https://phabricator.wikimedia.org/T313520 (10Vgutierrez) p:05Triage→03Medium a:03Andrew [14:51:09] <_joe_> marostegui: and that didn't happen? [14:51:17] _joe_: per Amir1...nop [14:51:29] so it didn't go to business hours oncall at all. [14:51:31] indeed [14:51:56] yup, went straight to the batphone, that's my understanding, is that correct ? [14:52:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P31722 and previous config saved to /var/cache/conftool/dbconfig/20220722-145201-ladsgroup.json [14:52:06] <_joe_> akosiaris: not sure the timing coincides [14:52:13] akosiaris: it was definitely 5 minutes before my phone buzzed with a push notif from VO [14:52:16] akosiaris: But the first p.age didn't do it, and it was the second one [14:52:19] <_joe_> mine too [14:52:24] same [14:52:55] there is no wait from the logs [14:53:00] maybe it means if there is no on call, it goes to blackhole? [14:53:10] it paged "everyone" almost imediately, although it took a while [14:53:12] <_joe_> for 5 minutes apparently [14:53:39] oh, no [14:53:49] itr tried to contact cdanis first? [14:53:52] that's weird [14:54:06] <_joe_> incident chris FTW [14:54:07] jynus: I'm on the 'opt-in' email alerts rotation [14:54:07] my IRC logs + pa.ge coincide btw [14:54:12] 17:39 both [14:54:17] ah, so indeed there was a gap [14:54:23] just not for thoush with email [14:54:24] <_joe_> i got mine at 16:44 [14:54:32] <_joe_> from VO i mean [14:54:45] my push notification via the VO app was also :44 [14:54:53] yeah [14:55:01] my "page" was from email [14:55:09] sorry, I just p people again [14:55:51] not sure if that has a fix [14:55:58] I want to edit my survey response 😅 [14:56:03] * jbond also got push at 16:44 [14:56:38] yeah, I only saw it early becaues I have email notifications [14:57:17] <_joe_> that doesn't really count though, as far as VO functionality is concerned [14:57:21] ok, I was wrong [14:57:32] 2 things, is my patch ok, akosiaris? [14:57:35] went through the timeline, mine was at :44 too [14:57:36] <_joe_> there is a 5 minutes delay [14:57:44] should I file a ticket for the delay? [14:57:54] jynus: yeah, I am merging [14:57:55] > Note: If there is no on-call user scheduled in a rotation at the time when this escalation action is triggered, the resulting behavior is that no page will occur in this step. The time delay before the next step will remain as configured. For example, if an incident triggers an Escalation Policy during off-hours and there is no one on call in the rotation to immediately page, the escalation [14:57:58] policy will page no one and then wait however long is specified before executing step two. [14:57:59] https://help.victorops.com/knowledge-base/team-escalation-policy/ [14:58:29] if I had new it was no issue, I would have acked it early [14:58:45] but I thought at that poing it was a production issue I didn't know to properly handle [14:59:21] I guess the question now is do we revert to paging everyone immediately? [14:59:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] zookeeper: Disable notifications on conf1007,conf1008,conf1009 [puppet] - 10https://gerrit.wikimedia.org/r/816181 (https://phabricator.wikimedia.org/T311407) (owner: 10Jcrespo) [14:59:45] herron: we could work around this [14:59:48] herron: at least track the bug, I would say and decide on phab? [15:00:20] (03PS3) 10Jcrespo: zookeeper: Disable notifications on conf1007,conf1008,conf1009 [puppet] - 10https://gerrit.wikimedia.org/r/816181 (https://phabricator.wikimedia.org/T311407) [15:00:40] herron: is there a reason why, instead of using the current mechanisms, we couldn't query the API for if anyone is currently active in business hours oncall, and then change the routing key on the page [15:01:22] (03CR) 10Alexandros Kosiaris: [V: 03+2] zookeeper: Disable notifications on conf1007,conf1008,conf1009 [puppet] - 10https://gerrit.wikimedia.org/r/816181 (https://phabricator.wikimedia.org/T311407) (owner: 10Jcrespo) [15:01:39] herron: the code for checking current oncalls exists already btw :) [15:01:41] !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2046.codfw.wmnet with reason: host reimage [15:02:15] Alex sorry for the dirty initial patch, under stress I like to get things done even if not properly :-) (I like to work on head) [15:02:34] jynus: no worries [15:02:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T312863)', diff saved to https://phabricator.wikimedia.org/P31723 and previous config saved to /var/cache/conftool/dbconfig/20220722-150239-ladsgroup.json [15:02:41] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:02:42] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [15:02:43] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [15:02:50] hmm the escalations take multiple rosters into account, not sure? [15:02:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1102.eqiad.wmnet with reason: Maintenance [15:03:01] should have fixed those conf* hosts some time ago [15:03:10] will create the revert and comment on ticket so we don't forget [15:03:45] !log ruby-rbtree to puppet7 component [15:03:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:49] no issue, just thanks to alex we got the ability to set that up on puppet- dbas use that all the time to avoid false Ps [15:04:06] ottomata: I 'll need some help putting new zookeeper hosts into production and wikitech is short on documentation. Can we sync up on say Monday? [15:05:11] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2046.codfw.wmnet with reason: host reimage [15:05:39] (03PS1) 10Jgreen: Update nsca_frack.cfg.erb replacing host frdb1002 with frdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/816182 (https://phabricator.wikimedia.org/T312583) [15:06:09] (03PS1) 10Jcrespo: zookeeper: Reenable notifications for conf1007,conf1008,conf1009 [puppet] - 10https://gerrit.wikimedia.org/r/816149 [15:06:36] (03CR) 10Jcrespo: [C: 04-2] "Do not merge until Alex/Otto finish their setup work." [puppet] - 10https://gerrit.wikimedia.org/r/816149 (owner: 10Jcrespo) [15:07:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T312863)', diff saved to https://phabricator.wikimedia.org/P31724 and previous config saved to /var/cache/conftool/dbconfig/20220722-150707-ladsgroup.json [15:07:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:07:10] (03PS2) 10Jcrespo: zookeeper: Reenable notifications for conf1007,conf1008,conf1009 [puppet] - 10https://gerrit.wikimedia.org/r/816149 (https://phabricator.wikimedia.org/T311407) [15:07:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [15:07:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T312863)', diff saved to https://phabricator.wikimedia.org/P31725 and previous config saved to /var/cache/conftool/dbconfig/20220722-150727-ladsgroup.json [15:07:44] (03CR) 10Jaime Nuche: scap: allow `scap` user to login into deployment-prep scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816140 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [15:08:21] (03CR) 10Ahmon Dancy: [C: 03+1] scap: allow `scap` user to login into deployment-prep scap targets (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816140 (https://phabricator.wikimedia.org/T303559) (owner: 10Jaime Nuche) [15:08:56] (03CR) 10Jcrespo: zookeeper: Reenable notifications for conf1007,conf1008,conf1009 [puppet] - 10https://gerrit.wikimedia.org/r/816149 (https://phabricator.wikimedia.org/T311407) (owner: 10Jcrespo) [15:09:47] (03CR) 10Jgreen: [C: 03+2] Update nsca_frack.cfg.erb replacing host frdb1002 with frdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/816182 (https://phabricator.wikimedia.org/T312583) (owner: 10Jgreen) [15:16:25] 10SRE-OnFire, 10observability: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) [15:17:08] 10SRE-OnFire, 10observability: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) [15:19:43] 10SRE-OnFire, 10observability: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) BTW, if it would be helpful for me to add a CLI utility to fetch the current business hours oncallers, I'm very happy to d... [15:20:52] (03PS1) 10Ladsgroup: mariadb: Fix variable name in grant file [puppet] - 10https://gerrit.wikimedia.org/r/816186 [15:21:54] !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2046.codfw.wmnet with OS bullseye [15:22:01] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host elastic2046.codfw.wmnet with OS bullseye completed: - elastic2033 (**PAS... [15:25:01] (03PS1) 10Jbond: C:package_builder: Allow users to add a specific component to the build [puppet] - 10https://gerrit.wikimedia.org/r/816187 (https://phabricator.wikimedia.org/T313387) [15:27:26] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/pcc-worker1001/36357/build2001.codfw.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/816187 (https://phabricator.wikimedia.org/T313387) (owner: 10Jbond) [15:30:56] (03CR) 10Ladsgroup: "PCC agrees https://puppet-compiler.wmflabs.org/pcc-worker1003/36358/db1173.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/816186 (owner: 10Ladsgroup) [15:31:01] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Fix variable name in grant file [puppet] - 10https://gerrit.wikimedia.org/r/816186 (owner: 10Ladsgroup) [15:36:51] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [15:37:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [15:37:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [15:38:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [15:49:39] !log ruby-sorted-set to puppet7 component [15:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:27] !log ruby-semantic-puppet to puppet7 component [15:57:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:53] !log puppet-agent to puppet7 component [16:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:15] PROBLEM - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - Found 2 eligible masters. https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [16:14:33] looking [16:18:33] PROBLEM - Check systemd state on elastic2031 is CRITICAL: CRITICAL - degraded: The following units failed: elasticsearch-disable-readahead.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:19:07] Looks like 2031 (the missing master on elasticsearch) is getting reimaged but the cookbook's blocked on user input before the actual reimage; gonna abort [16:19:28] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster reimage (bullseye upgrade) - bking@cumin1001 - T289135 [16:19:34] T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 [16:21:28] Ran puppet on 2031, its elasticsearch services should be back up soon [16:23:33] RECOVERY - Check systemd state on elastic2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:23:47] RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [16:30:41] PROBLEM - NFS on clouddumps1001 is CRITICAL: connect to address 208.80.154.142 and port 2049: Connection refused https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [16:33:17] ACKNOWLEDGEMENT - HTTP on clouddumps1001 is CRITICAL: connect to address 208.80.154.142 and port 80: Connection refused Andrew Bogott these boxes are still a work in progress https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [16:33:18] ACKNOWLEDGEMENT - NFS on clouddumps1001 is CRITICAL: connect to address 208.80.154.142 and port 2049: Connection refused Andrew Bogott these boxes are still a work in progress https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [16:33:22] ACKNOWLEDGEMENT - HTTP on clouddumps1002 is CRITICAL: connect to address 208.80.154.71 and port 80: Connection refused Andrew Bogott these boxes are still a work in progress https://wikitech.wikimedia.org/wiki/Dumps/XML-SQL_Dumps%23A_labstore_host_dies_%28web_or_nfs_server_for_dumps%29 [16:33:23] ACKNOWLEDGEMENT - NFS on clouddumps1002 is CRITICAL: connect to address 208.80.154.71 and port 2049: Connection refused Andrew Bogott these boxes are still a work in progress https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore [16:45:56] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: no-op deploy to sync up new cloudweb hosts [16:47:56] (03PS3) 10Bernard Wang: Remove Table of Contents config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/810405 (https://phabricator.wikimedia.org/T310527) (owner: 10Clare Ming) [16:48:23] 10SRE, 10ops-eqiad: Degraded RAID on cloudweb1003 - https://phabricator.wikimedia.org/T313520 (10Andrew) 05Open→03Resolved I no longer see this error, I think it must've been transitional when the host was originally puppetized. [16:50:10] 10SRE, 10Campaign-Tools, 10Foundational Technology Requests: [Request for Comment] Campaigns Geolocation API proposal - https://phabricator.wikimedia.org/T312677 (10jcrespo) I'm adding @vgutierrez to the ticket as he is on clinic duty this week and he could route to the right person. [16:51:52] 10ops-eqiad, 10decommission-hardware: decommission frdb1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T313607 (10Jgreen) [16:54:43] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: no-op deploy to sync up new cloudweb hosts (duration: 08m 47s) [17:20:51] (03CR) 10BCornwall: geodns: Move eqsin, drmrs and esams around in Asia (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/816053 (owner: 10BCornwall) [17:31:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [17:32:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [17:32:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T312863)', diff saved to https://phabricator.wikimedia.org/P31727 and previous config saved to /var/cache/conftool/dbconfig/20220722-173218-ladsgroup.json [17:32:22] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [17:32:53] (03PS1) 10Jbond: P:puppet::agent: Add parameter to install puppet-agent 7 [puppet] - 10https://gerrit.wikimedia.org/r/816205 [17:33:41] (03PS2) 10BCornwall: geodns: Move eqsin, drmrs and esams around in Asia [dns] - 10https://gerrit.wikimedia.org/r/816053 (https://phabricator.wikimedia.org/T311472) [17:33:48] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36359/console" [puppet] - 10https://gerrit.wikimedia.org/r/816205 (owner: 10Jbond) [17:46:29] (03CR) 10BCornwall: geodns: Map out African countries by DC latency (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/816028 (https://phabricator.wikimedia.org/T311472) (owner: 10BCornwall) [17:52:31] 10SRE-OnFire, 10observability: Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10herron) As a very near term stopgap to avoid delays over the weekend I'll be updating the VO config to page batphone immediately t... [17:54:01] ahhh so that's why those pages were sometimes slow [17:54:04] (03CR) 10Dzahn: [C: 03+2] "yea, I totally agree, good to have a task and can be done later, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/815400 (https://phabricator.wikimedia.org/T313250) (owner: 10Dzahn) [17:54:09] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [17:55:11] 10SRE-OnFire, 10observability, 10SRE Observability (FY2022/2023-Q1): Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10lmata) [18:34:42] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10Dzahn) Hi @Aline_Bruenger_WMDE we still need your help to get this moving. Please register a user on the wikitech wiki at https://wikitech.wikimedia.org/ and let us know when tha... [18:36:51] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10Dzahn) Nevermind, I found the user in LDAP and know the WMDE email format. No worries, patch is incoming. [18:42:23] (03PS1) 10Dzahn: admin: add Aline Bruenger to ldap_only admins (wmde,nda) [puppet] - 10https://gerrit.wikimedia.org/r/816213 (https://phabricator.wikimedia.org/T312220) [18:43:33] (03CR) 10Dzahn: "when this is merged it needs the "mwmaint1002:~] $ sudo modify-ldap-group wmde" and "mwmaint1002:~] $ sudo modify-ldap-group nda"." [puppet] - 10https://gerrit.wikimedia.org/r/816213 (https://phabricator.wikimedia.org/T312220) (owner: 10Dzahn) [18:45:41] 10SRE-OnFire, 10observability, 10SRE Observability (FY2022/2023-Q1): Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10herron) We have also raised an issue with VO about this to at a minimum document the probl... [18:46:47] (03PS2) 10Dzahn: admin: add Aline Bruenger to ldap_only admins (wmde,nda) [puppet] - 10https://gerrit.wikimedia.org/r/816213 (https://phabricator.wikimedia.org/T312220) [18:54:34] (03PS10) 10Dzahn: DO-NOT-SUBMIT(Under review and discussion): [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [18:58:07] 10SRE-OnFire, 10observability, 10SRE Observability (FY2022/2023-Q1): Business hours oncall implementation delays pages to batphone by 5 minutes when there are no oncallers - https://phabricator.wikimedia.org/T313603 (10CDanis) >>! In T313603#8098056, @herron wrote: > We have also raised an issue with VO abou... [18:58:46] (03CR) 10Dzahn: "@Mary I was bold and uploaded PS10, that made the CI bot happy" [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [18:59:59] (03CR) 10Dzahn: "here is just the diff between PS9 and PS10 btw: https://gerrit.wikimedia.org/r/c/operations/puppet/+/810146/9..10" [puppet] - 10https://gerrit.wikimedia.org/r/810146 (https://phabricator.wikimedia.org/T311457) (owner: 10Mary Yang) [19:30:50] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for SGrabarczuk - https://phabricator.wikimedia.org/T313616 (10sgrabarczuk) [19:43:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [19:44:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1112.eqiad.wmnet with reason: Maintenance [19:44:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:44:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [19:44:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T312863)', diff saved to https://phabricator.wikimedia.org/P31729 and previous config saved to /var/cache/conftool/dbconfig/20220722-194428-ladsgroup.json [19:44:32] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [19:51:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T312863)', diff saved to https://phabricator.wikimedia.org/P31730 and previous config saved to /var/cache/conftool/dbconfig/20220722-195153-ladsgroup.json [19:51:57] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [19:58:59] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 131, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:59:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:01:23] (03PS1) 10Dzahn: deployment_server: add gerrit host key for mwpresync pushing to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/816221 [20:04:02] (03PS2) 10Dzahn: deployment_server: add gerrit host key for mwpresync pushing to gerrit [puppet] - 10https://gerrit.wikimedia.org/r/816221 (https://phabricator.wikimedia.org/T303857) [20:06:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P31731 and previous config saved to /var/cache/conftool/dbconfig/20220722-200658-ladsgroup.json [20:22:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P31732 and previous config saved to /var/cache/conftool/dbconfig/20220722-202203-ladsgroup.json [20:27:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312863)', diff saved to https://phabricator.wikimedia.org/P31733 and previous config saved to /var/cache/conftool/dbconfig/20220722-202743-ladsgroup.json [20:27:48] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [20:37:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T312863)', diff saved to https://phabricator.wikimedia.org/P31734 and previous config saved to /var/cache/conftool/dbconfig/20220722-203708-ladsgroup.json [20:37:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [20:37:13] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [20:37:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [20:37:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [20:37:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on 8 hosts with reason: Maintenance [20:42:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P31735 and previous config saved to /var/cache/conftool/dbconfig/20220722-204248-ladsgroup.json [20:44:44] !log brennen@deploy1002 Started deploy [phabricator/deployment@f962d0e]: (no justification provided) [20:44:51] !log brennen@deploy1002 Finished deploy [phabricator/deployment@f962d0e]: (no justification provided) (duration: 00m 07s) [20:57:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P31736 and previous config saved to /var/cache/conftool/dbconfig/20220722-205754-ladsgroup.json [21:04:49] !log brennen@deploy1002 Started deploy [phabricator/deployment@f962d0e]: (no justification provided) [21:05:18] !log brennen@deploy1002 Finished deploy [phabricator/deployment@f962d0e]: (no justification provided) (duration: 00m 29s) [21:08:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T312863)', diff saved to https://phabricator.wikimedia.org/P31737 and previous config saved to /var/cache/conftool/dbconfig/20220722-210813-ladsgroup.json [21:08:18] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [21:12:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T312863)', diff saved to https://phabricator.wikimedia.org/P31738 and previous config saved to /var/cache/conftool/dbconfig/20220722-211259-ladsgroup.json [21:13:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [21:13:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [21:13:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T312863)', diff saved to https://phabricator.wikimedia.org/P31739 and previous config saved to /var/cache/conftool/dbconfig/20220722-211308-ladsgroup.json [21:13:21] (03PS1) 10Brennen Bearnes: scap: set keyholder_key: phabricator [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/816223 (https://phabricator.wikimedia.org/T313259) [21:13:34] (03CR) 10BryanDavis: [C: 03+1] wikimedia.org: Add developers.wikimedia.org alias [dns] - 10https://gerrit.wikimedia.org/r/816174 (https://phabricator.wikimedia.org/T313597) (owner: 10Zabe) [21:13:52] (03CR) 10BryanDavis: [C: 03+1] mediawiki: Redirect developers.wm.o to developer.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/816175 (https://phabricator.wikimedia.org/T313597) (owner: 10Zabe) [21:14:49] (03CR) 10Brennen Bearnes: [V: 03+2 C: 03+2] "Tested from deploy1002, confirmed working." [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/816223 (https://phabricator.wikimedia.org/T313259) (owner: 10Brennen Bearnes) [21:17:12] (03CR) 10RhinosF1: [C: 03+1] wikimedia.org: Add developers.wikimedia.org alias [dns] - 10https://gerrit.wikimedia.org/r/816174 (https://phabricator.wikimedia.org/T313597) (owner: 10Zabe) [21:18:40] (03CR) 10RhinosF1: [C: 03+1] mediawiki: Redirect developers.wm.o to developer.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/816175 (https://phabricator.wikimedia.org/T313597) (owner: 10Zabe) [21:20:16] (03CR) 10Dzahn: [C: 03+1] scap: set keyholder_key: phabricator [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/816223 (https://phabricator.wikimedia.org/T313259) (owner: 10Brennen Bearnes) [21:21:12] (03CR) 10Dzahn: [C: 03+1] "[deploy1002:~] $ SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh -oIdentitiesOnly=yes -oIdentityFile=/etc/keyholder.d/phabricator phab-deploy@" [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/816223 (https://phabricator.wikimedia.org/T313259) (owner: 10Brennen Bearnes) [21:23:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P31740 and previous config saved to /var/cache/conftool/dbconfig/20220722-212319-ladsgroup.json [21:29:47] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:38:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P31741 and previous config saved to /var/cache/conftool/dbconfig/20220722-213824-ladsgroup.json [21:53:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T312863)', diff saved to https://phabricator.wikimedia.org/P31742 and previous config saved to /var/cache/conftool/dbconfig/20220722-215329-ladsgroup.json [21:53:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [21:53:37] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [21:53:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1113.eqiad.wmnet with reason: Maintenance [21:53:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T312863)', diff saved to https://phabricator.wikimedia.org/P31743 and previous config saved to /var/cache/conftool/dbconfig/20220722-215349-ladsgroup.json [22:31:11] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:08:19] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Assign SPDX headers to puppet.git - https://phabricator.wikimedia.org/T308013 (10Dylsss) [23:19:06] (03CR) 10Cwhite: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/816135 (https://phabricator.wikimedia.org/T312947) (owner: 10Filippo Giunchedi) [23:19:33] (03CR) 10Cwhite: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/816136 (https://phabricator.wikimedia.org/T312947) (owner: 10Filippo Giunchedi) [23:22:29] (03CR) 10Cwhite: prometheus: update blackbox check alerts runbook link (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/816135 (https://phabricator.wikimedia.org/T312947) (owner: 10Filippo Giunchedi) [23:26:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T312863)', diff saved to https://phabricator.wikimedia.org/P31744 and previous config saved to /var/cache/conftool/dbconfig/20220722-232609-ladsgroup.json [23:26:16] T312863: Schema change to change primary key of templatelinks - https://phabricator.wikimedia.org/T312863 [23:41:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P31745 and previous config saved to /var/cache/conftool/dbconfig/20220722-234114-ladsgroup.json [23:56:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P31746 and previous config saved to /var/cache/conftool/dbconfig/20220722-235619-ladsgroup.json