[01:52:09] 10DBA, 10Patch-For-Review: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556 (10Johan) Since next Tech News won't go out until Monday anyway (and I don't think we need to be too concerned about a few seconds of Etherpad read-only) – do re-instate if the update for some reason go... [04:44:20] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10Marostegui) Thanks @Dzahn There is still a question from T254795#6204661 that needs answsering: - Which grants do we need this user to have? Also, to confirm, connections will come from: mwdebug1001, mwdebug... [04:47:24] 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [04:57:59] 10DBA, 10Operations: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Marostegui) Anything else left here after the 100% repool or we can close this? Thank you! [05:02:42] 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [05:05:21] 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [06:24:09] 10DBA, 10Performance-Team: Database for XHGui profiles - https://phabricator.wikimedia.org/T254795 (10Dzahn) @dpifke Doesn't it have to connect from xhgui1001/xhgui2001 (but would that be in addition to mwdebug and webperf* ?) [07:39:53] 10DBA, 10Operations: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Kormat) 05Open→03Resolved Nope, all done! [07:51:00] Jun 25 07:50:29 backup1001 systemd[1]: Stopped Bacula Director Daemon service. [07:51:12] \o/ [07:51:37] remind me the ticket, sorry, if you have it handy? [07:51:49] https://phabricator.wikimedia.org/T254556 [08:09:24] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10Kormat) 05Open→03Resolved Array rebuild has completed, and is back in "optimal" state. [08:40:14] 10DBA, 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10jbond) >>! In T256120#6252975, @Marostegui wrote: > Should be fixed now. Thanks although I'm now getting "Error message: CREATE command denied... [08:40:23] only db2133 was in core [08:41:57] 10DBA, 10Operations, 10CAS-SSO, 10Patch-For-Review, 10User-jbond: Request new database for idp-test.wikimedia.org - https://phabricator.wikimedia.org/T256120 (10Marostegui) Fixed [08:42:21] if we get rid of multi-source hosts we could have a simpler schema [08:42:38] did someone say nuke labsdb*? ;) [08:42:58] but because multisource, the group is on the instance, and the section is on the replication table (section_instances) [08:44:35] 10DBA, 10Patch-For-Review: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556 (10Marostegui) This is done. I am going to leave db1135 replicating for 24h (so we can also see if basic 10.4 -> 10.1 replication works) and then I will move db1135 somewhere else. [08:44:42] 10DBA, 10Epic: Upgrade WMF database-and-backup-related hosts to buster - https://phabricator.wikimedia.org/T250666 (10Marostegui) [08:44:44] 10DBA, 10Patch-For-Review: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556 (10Marostegui) 05Open→03Resolved [08:46:42] I've run a couple of backups succesfully already [08:46:47] sweet [08:47:48] also if etherpad worked on 10.4, it will work with anything :-D [08:47:59] *10.4 will [08:48:46] hahahaha [08:48:47] yeah [08:48:52] I thought the same [09:10:21] while checking prometheus I saw metrics gathering is failing for db1077 [09:10:41] server seems to be up, should I take a look or is someone working on it/setting it up? [09:11:16] it may just need a prometheus restart [09:12:54] you can ignore it, it is the testing one [09:13:04] probably grants missing or something [09:13:05] ah, ok [09:13:36] going back to my transfers and backups :-D [10:49:15] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) Moved this to Tuesday 7th July at 05:00 AM UTC as I will be off the 1st of July, and I want to keep an eye after the switchover and the following days. [11:34:01] 10DBA, 10Patch-For-Review: Switchover es5 master from es1023 to es1024 - https://phabricator.wikimedia.org/T255755 (10Marostegui) [11:34:23] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [11:59:42] 10DBA: Relocate "old" s4 hosts - https://phabricator.wikimedia.org/T253217 (10Marostegui) [12:43:35] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) s2 eqiad progress [] labsdb101... [12:43:59] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [13:36:47] 10DBA, 10DC-Ops, 10Operations, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Marostegui) Can this task be closed? By default hosts reimage now but they do kee... [14:03:30] 10DBA, 10DC-Ops, 10Operations, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10jcrespo) a:03Kormat [14:03:55] * kormat shakes his fist at jynus [14:04:46] it is a decision-making asignment, not a doing assignment, eh! [14:05:18] decide and then unasign if kept open [14:06:08] I could also asign it to faidon, which I think was the person that requested it so it is sent to foundations [14:06:12] up to you, really [14:06:25] what? [14:06:43] should T251416 be open? [14:06:47] T251416: PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 [14:06:55] not sure if a dba decision or a foundations decision [14:07:01] I did not request that :) [14:07:16] you requested me to open a ticket to understand the issue, but maybe I am wrong [14:07:21] ah maybe! [14:07:32] it rings a bell now [14:07:38] I just want to make sure nobody is waiting on me [14:07:54] I was envisioning it as something to be discussed across multiple SREs, not a task for me specifically to decide [14:07:58] but as the dba issue has been solved, we can send it to your ream for long term [14:08:09] although I'm happy to make that decision if noone else has any opinions on that :) [14:08:10] s/for you/for your team/ [14:08:33] please talk to kormat, I am not really up to date with lastest advances there, ok? [14:08:41] ok [14:08:47] hii [14:08:50] hii :) [14:09:05] so I assign it to him to mean "please don't want on me" if that makes sense? [14:09:12] *wait [14:10:07] paravoid: it would help if there was a foundations tag 0:-D [14:11:12] there is SRE-tools, but that may or may not be the best fit here -- I'll defer to volans [14:11:14] so I don't have ln -s foundations faidon :-D [14:11:24] ;-) [14:13:36] * volans reading backlog [14:14:39] volans: tl;dr for the part I pinged you about: how can a partman task be tagged; is SRE-tools appropriate for that, and if not, do we have an alternative to offer (besides #operations) [14:15:34] the long term solution is clearly SRE-tools for the PXE menu and all that work, although we know will not happen right now [14:15:46] yeah, that is known [14:15:54] but maybe closing it was not the right action [14:15:54] volans: is there a task for that? [14:15:59] we can add SRE-tools and leave it in the backlog for that [14:16:23] kormat: various [14:16:26] so the change is, please kormat correct me, is that we have a way forward for dbs/backup hosts ? [14:16:33] tracking is T116063 [14:16:34] T116063: Hardware Automation Workflow - Overall Tracking - https://phabricator.wikimedia.org/T116063 [14:16:38] and then a pletora od subtasks [14:16:44] but technically the issue is still ongoing in general? [14:16:50] jynus: correct [14:16:58] note the date, it's 5y ago, before I started [14:17:04] I don't know the details despite me writing the initial task [14:17:12] ok, i'm going to update the ticket, removing dba, adding sre-tools [14:17:17] (and unassigning myself :P) [14:17:24] yep, all cool to me [14:17:52] maybe let's add the one line summary as the last comment [14:18:00] if the helps [14:18:03] *that [14:18:16] yes please [14:18:20] yes. i've already written that bit [14:18:46] let me find the operations task to add SRE-tools [14:18:51] *project [14:19:18] 10DBA, 10DC-Ops, 10Operations, 10Sustainability (Incident Prevention): PXE Boot defaults to automatically reimaging (normally destroying os and all filesystemdata) on all servers - https://phabricator.wikimedia.org/T251416 (10Kormat) From the perspective of #dba, this issue is mostly resolved. Most DB mach... [14:19:36] oh, it is there [14:19:43] but let me add observability team [14:20:45] let me know if it looks correct: https://phabricator.wikimedia.org/project/manage/1025/ [14:21:18] should I create a #backups project? [14:21:29] there was one no? [14:21:38] marostegui: maybe it didn't get backed up ;) [14:21:49] I thought we had a backup tag or something [14:21:51] lol [14:21:52] nope [14:21:57] At least we discussed it some time ago [14:22:02] But don't remember what was the conclusion [14:22:07] yeah, the issue is it coudl be missleading [14:22:15] because "production sre backups" [14:22:24] vs "I am backuping my tool poject" [14:22:36] plus having a separate project board? [14:22:40] https://www.youtube.com/watch?v=MgxgYL5P4z4 [14:22:41] not something to discuss here [14:23:05] marostegui: let's discuss on next meeting, ok? [14:23:16] that ^ should be on the first page of the backup tag project :-) [14:23:21] could do make DBA a "data persistance" team tak [14:23:27] and have yellow backup tag [14:23:55] don't know [14:54:15] 10DBA, 10Operations, 10SRE-tools, 10Patch-For-Review: Audit all cumin queries in switchdc scripts - https://phabricator.wikimedia.org/T243935 (10Kormat) [14:56:04] 10DBA: Create reuse recipes for tendril/zarcillo/dbprov/backup hosts - https://phabricator.wikimedia.org/T255768 (10Kormat) [15:19:20] marostegui: 1100 drifts. Mostly the MCR stuff [15:19:39] I try to make it foldable so we ignore those for now [17:24:12] marostegui: These are the drifts excluding MCR ones: https://phabricator.wikimedia.org/P11667 [17:24:25] (in total, around 100-ish) [17:24:56] This is all of them: https://phabricator.wikimedia.org/P11668 [17:27:36] btw MCR schema changes caused around 10% in size reduction in s6: https://grafana.wikimedia.org/d/000000377/host-overview?panelId=28&fullscreen&orgId=1&var-server=db1131&var-datasource=thanos&var-cluster=mysql&from=1592944447567&to=1593103141444 [17:27:48] is s1 and s8 it'll be massive [17:53:08] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Configure WMF wikis to log login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Huji) @DannyS712 when you get the chance, can I ask you to please review https://gerrit.wikimedia.org/r/605301/ ? I am going to follow up...