[01:43:10] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) ` papaul@asw-a-codfw# show | compare [edit interfaces interface-range vlan-private1-a-codfw] member ge-6/0/13 { ... } + member ge-1/0/7... [01:43:27] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [01:52:14] 10DBA, 10DC-Ops, 10Operations, 10ops-codfw, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [02:20:54] 10DBA, 10MediaWiki-User-management, 10MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), 10Platform Team Workboards (Clinic Duty Team), and 2 others: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 (10Ladsgroup) I hate reopening tickets so I won't but my ch... [05:09:57] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [05:17:38] 10DBA, 10MediaWiki-User-management, 10MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), 10Platform Team Workboards (Clinic Duty Team), and 2 others: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 (10Marostegui) This is useful Amir! Don't worry! You are he... [05:17:47] 10DBA, 10MediaWiki-User-management, 10MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), 10Platform Team Workboards (Clinic Duty Team), and 2 others: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 (10Marostegui) 05Resolved→03Open [05:17:49] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Followup), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [05:18:35] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables thanks to the addition of login data - https://phabricator.wikimedia.org/T261999 (10Marostegui) p:05Triage→03Medium [05:22:50] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables thanks to the addition of login data - https://phabricator.wikimedia.org/T261999 (10Marostegui) [05:24:48] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables thanks to the addition of login data - https://phabricator.wikimedia.org/T261999 (10Marostegui) First figures added [05:38:22] 10DBA, 10MediaWiki-User-management, 10MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), 10Platform Team Workboards (Clinic Duty Team), and 2 others: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 (10Marostegui) The following wikis are also affected on tho... [05:56:35] 10DBA, 10Datasets-General-or-Unknown, 10Patch-For-Review, 10Sustainability (Incident Followup), 10WorkType-NewFunctionality: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves - https://phabricator.wikimedia.org/T104459 (10Marostegui) [05:56:58] 10DBA, 10MediaWiki-User-management, 10MW-1.35-notes (1.35.0-wmf.30; 2020-04-28), 10Platform Team Workboards (Clinic Duty Team), and 2 others: Rename ipb_address index on ipb_address to ipb_address_unique - https://phabricator.wikimedia.org/T250071 (10Marostegui) 05Open→03Resolved I have fixed those hos... [09:22:23] 10DBA, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)): Run wmfmariadbpy integration test suite on CI - https://phabricator.wikimedia.org/T261098 (10hashar) [09:24:42] 10DBA, 10Continuous-Integration-Config, 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services), and 2 others: Run wmfmariadbpy integration test suite on CI - https://phabricator.wikimedia.org/T261098 (10Kormat) [10:14:48] make sure someone is in the cloud channel, as sometimes mariadb module changes break cloud instances outside of production [10:16:01] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) [10:16:09] 10DBA, 10Epic, 10Patch-For-Review: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) [10:16:11] 10DBA, 10Patch-For-Review, 10User-Kormat, 10cloud-services-team (Kanban): Upgrade m5 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T260324 (10Marostegui) 05Open→03Resolved [10:16:36] marostegui: congrats! :) [10:17:02] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) @jcrespo db1133 is ready for you. It was originally thought to be placed on backup testing, but feel free to move it around the backup infra wherever you prefer It has notifications disabled. [10:17:28] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [10:18:07] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [10:18:27] \o/ [10:20:27] 10DBA, 10Patch-For-Review: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10jcrespo) Thanks, @Marostegui. It really would be helpful to test backup recoveries on eqiad, too without affecting database testing. I belive db1077 will "return" to you (mw db testing) soon: T187984#6433997 [10:20:58] how was the switchover yesterday (I had a meeting with our manager at the same time) [10:21:12] ? [10:21:15] it went ok [10:22:47] I have spent all of today writing the media backups document, I think I am more than halfway [10:28:34] jynus: fyi, I am scheduling the OTRS upgrade on Monday Sept 14th. If everything goes well, we should be done by Wednesday and then cleanup the various stuff we had to setup for this (e.g. the otrs snapshot on db1077). [10:29:12] akosiaris: do we need an m2 snapshot early on that monday? [10:29:20] jynus: I was about to ask for that :-) [10:29:35] maybe even one loaded into db1077 for quick failover? [10:29:55] jynus: whatever suits you better [10:30:29] recovery is at network speed, but I think it is non-negible for OTRS database [10:30:57] we are anyway scheduling a 48h window [10:31:05] or actually, maybe just stopping replication [10:31:15] so we aren't gonna be really pressured time wise [10:32:15] I am thinking [10:33:00] do you want db1077 to be available during the upgrade process? [10:33:20] nope, not needed [10:34:07] let paste a proposal of things to do db-wise before upgrade [10:41:08] akosiaris: https://phabricator.wikimedia.org/T187984#6435616 [10:42:12] in other words, we snapshot it, but keep a hot spare before maintenance ready to failover [10:42:24] that is, of course, on top of the regular backups [11:25:37] 10DBA, 10CheckUser: Monitor the growth of CheckUser tables thanks to the addition of login data - https://phabricator.wikimedia.org/T261999 (10Huji) [12:09:51] es2017 WARNING: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat-kill] ? [12:10:08] jynus: thanks, i'll take a look. [12:10:08] why is pt-heartbeat-kill called on es2017? [12:12:58] wait pt-heartbeat-kill ? [12:13:09] shouldn't it be pt-heatbeat or wmf-pt-kill? [12:13:24] jynus: it failed trying to kill pt-heartbeat, which wasn't running [12:13:29] ah [12:13:36] so it the name of the exec [12:13:41] not a daemon [12:13:53] ok, I thought there was a pt-kill process there [12:13:54] -rw-r--r-- 1 root root 4 Jan 11 2019 /var/run/pt-heartbeat.pid [12:14:12] deleted that file, should prevent that from reoccuring. [12:14:19] jynus: yeah ack :) [12:14:19] and I was getting worried pt-kill does only run on labsdb [12:14:35] I confused the 2 daemons [12:14:47] no issue [12:21:23] jynus: nice catch, and thanks for the heads-up [12:22:48] it was a non-issue, really [12:23:07] I just though it was something worse initially and that is why I pinged here [12:23:43] e.g. es2* host getting labsdb profiles or something [12:24:03] thanks for the quick fix [12:24:12] yeah i was also immediately concerned that one of my recent puppet CRs had broken things :) [12:24:38] which made no sense really, all your recent CRs were noop [12:24:45] or mostly noop, just refactoring [12:25:43] for context, we have an actual systemd process that handles pt-kill on labsdb [12:25:51] ah, cool. [12:25:58] we would also want to have systemd handling pt-heartbeat [12:26:04] but never got the time [12:26:08] yeah, that's on my vague todo list [12:26:20] bah, lots of work to do and not enough time! [12:26:34] in fact [12:26:42] c'est la guerre [12:26:45] we consider also at some point [12:26:55] moving pt-heartbeat to the application [12:27:07] to guarantee there was only 1 process per section [12:27:28] but the "multiple processes at the same time" was just easier on dc failover [12:27:41] just makes master swichorvers more complex [12:28:01] so at least being on systemd will be cleaner [12:35:13] marostegui, sobanski: hi! [12:35:23] o/ [12:35:26] so, i have a few things on my plate, and i was wondering about order [12:35:50] rebooting hosts: T261389, rebooting other hosts: T223430, and schema change: T259831 [12:35:51] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [12:36:01] stashbot: thanks, you tried. [12:36:01] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [12:36:39] kormat: and you might have even more next week :-) [12:36:46] * kormat whimpers [12:38:05] kormat: So I think for T223430 just do the eqiad ones, codfw can be done once we are back in eqiad, for T261389 same, focus on the primary masters first (but coordinate with me as I am deploying long running schema changes there), I normally do just 2-3 per day, on a daily basis, so it is not that annoying [12:38:07] Interestingly, I don't seem to have access to those [12:38:10] The schema change, we can start that next week [12:38:17] sobanski: I will add you, those are security related [12:38:25] Thanks [12:38:43] sobanski: done [12:39:31] kormat: So I think for the reboots, just do that as a background thing, just a few per day, focusing on primary masters and pcXXXX first I would say. I wouldn't do es1, es2 and es3, as those are going to be replaced very soon (HW should arrive soon to the DC) [12:40:02] ok cool [12:40:24] kormat: we can start the schema change next week, and we need to keep in mind that we'll have to check the PDU maintenance (I created event calendars for each day for the racks we've got hosts in) aaaaaand we can do this together too at some point: https://phabricator.wikimedia.org/T239238 [12:40:44] And: one of these https://phabricator.wikimedia.org/T186188#6418919 [12:40:55] marostegui: it sounds like i'm going to be spending a Lot of quality time with you. well crap. [12:41:17] kormat: don't you like codfw being primary? eqiad is quieter and we spend time together. it is my fav time of the year [12:41:22] haha [12:41:28] Xmas gift but in sept! [12:42:24] I have checked the hosts for the row D upgrade and we don't have any active service there (ie: misc masters) [12:42:33] So do we avoid making changes on the affected hosts during the PDU maintenance? [12:43:02] sobanski: There is no impact expected really, but for some hosts (eqiad masters) I would like to stop mysql, just in case the power does fail [12:43:08] So we can avoid possible corruptions [12:43:15] Got it [12:43:45] There are some hosts affected that do have service, like labsdb hosts and some proxies, as we don't failover those to codfw (that service doesn't exist in codfw) [12:43:56] So for those, I have asked Willy to be extra careful [12:44:07] Is there a way to coordinate that in real time or will you just stop them well in advance? [12:44:41] sobanski: Normally we stop them in advance, as sometimes mysql process can take long to stop, and then we join the DC ops channel for real time coordination [12:45:32] That is #wikimedia-dcops [12:45:48] I normally join when there is maintenance, otherwise...too many channels already :) [12:46:16] kormat: does that help your planning or did I confuse you even more? :) [12:46:23] marostegui: 'yes' [12:46:28] haha [12:47:06] kormat: TL;DR, feel free to start rebooting a few servers per day and we can chat about the schema change and the s8 eqiad master failover next week [12:47:32] it's a good thing i updated the cumin aliases recently [12:47:35] I am actively working on s8 and s4 primary masters, and I expect them to keep running stuff till Monday/Tuesday [12:47:50] `cumin A:db-all reboot` will greatly increase the progress here [12:48:01] kormat: And that's good, I thought you'd drop databases [12:48:08] :D [12:59:40] marostegui: huh. did we not failover pcX masters to codfw? [12:59:52] yes, we did, why? [13:00:00] tendril doesn't reflect that [13:00:04] ah yes [13:00:06] you mean in tendril [13:00:12] Yeah, I made a note to rzl about it [13:00:23] Let me fix that manually [13:00:25] to avoid confusions [13:00:29] thanks :) [13:00:46] It is on the DC switch document, as a follow up, I will get that sorted now [13:03:14] kormat: done! thanks for the reminder :) [13:03:19] yw ;) [14:39:42] kormat: apologies if my comment sometimes stress you: https://gerrit.wikimedia.org/r/c/operations/puppet/+/620899 [14:40:02] this was not intended to hurry you up as much as leaving a trail to keep it tracked [14:40:11] sorry if it had the unintended effect [14:40:59] consider all my gerrit comments (even IRC) as async communication- not needing immediate answer [14:41:57] jynus: nono, there was no problem [18:52:36] 10DBA, 10Operations: dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring - https://phabricator.wikimedia.org/T262085 (10Dzahn) [18:56:10] 10DBA, 10Operations: dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring - https://phabricator.wikimedia.org/T262085 (10jcrespo) Dzahn, this is ticket is 100% accurate, but you may not be aware of the why of this- which is explained on T224589. I would suggest to add your comments t... [19:01:30] 10DBA, 10Operations: dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring - https://phabricator.wikimedia.org/T262085 (10jcrespo) tl;tr: If we want to make tendril work, we need to revert dbmonitor2001 back to jessie to have the php-mysql extension, which would be a huge security con... [19:03:11] 10DBA, 10Operations: dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring - https://phabricator.wikimedia.org/T262085 (10Dzahn) [19:04:32] 10DBA, 10Operations: dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring - https://phabricator.wikimedia.org/T262085 (10Dzahn) Was about to paste the relevant part and ask more questions about this when I saw your comment. Ack, merged it in as a duplicate. ` 5 class role::tend... [19:05:20] 10DBA, 10Operations: dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring - https://phabricator.wikimedia.org/T262085 (10Dzahn) >>! In T262085#6436919, @jcrespo wrote: > tl;tr: If we want to make tendril work, we need to revert dbmonitor2001 back to jessie to have the php-mysql exten...