[05:37:52] 10DBA, 10Patch-For-Review: Drop ct_ indexes on change_tag - https://phabricator.wikimedia.org/T205913 (10Marostegui) [06:02:44] jynus: I am reviewing the wikis imported and they need a schema change [06:02:49] jynus: Can I run it? [06:06:24] yes! [06:06:35] great! [06:06:40] Doing it! [06:06:43] no [06:06:44] wait [06:06:46] No? [06:06:49] ok :) [06:06:54] we should let it replicate [06:07:04] it may be on the pending replication [06:07:18] then, you apply the schema [06:07:21] if not [06:07:28] but db1070 isn't delayed [06:07:31] makes sense? [06:07:33] (I checked on db1070) [06:07:35] it is [06:07:42] oh, you mean s5? [06:07:55] or the wikis from s3 [06:07:59] No, s3 - but you are right, I checked tendril, not the host itself [06:08:05] And tendril isn't dounig show slave 's3' status :) [06:08:09] the wikis are delayed [06:08:13] at the moment [06:08:19] so they are non-canonical [06:08:38] they have around 1.5 days of delay [06:09:15] db1070 is replicating from s3 codfw? [06:09:20] Or eqiad in the end? [06:09:28] nothing at the moment [06:09:33] except s5 codfw [06:09:48] Ah ok [06:09:52] I was waiting for the import to finish on the replicas [06:09:54] I will wait then yeah [06:10:06] which is 20-30 minutes [06:10:16] then will setup replication [06:10:23] and then we can do checks :-) [06:10:30] Yeah, I just saw the replicas are replicating shwiki [06:10:38] like, importing it [06:10:43] yeah, but for the import still [06:10:46] yep [06:18:51] while we wait the 3 GB left to import, I will start a backup of the 4 sections missing on eqiad [06:19:02] sounds good! [06:25:13] 10DBA, 10Operations, 10ops-codfw: rack/setup/install db2096 (x1 expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) [06:25:54] 10DBA, 10Operations, 10ops-codfw: rack/setup/install db2096 (x1 expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) 05Open>03stalled p:05Triage>03Normal Stalled as the server hasn't been received yet [06:26:16] 10DBA, 10Operations, 10ops-codfw: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) [06:26:22] that should get rid of the missing backup alerts [06:26:38] and heat up the hosts for next week [06:34:53] 10DBA, 10Operations, 10ops-codfw: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) [06:38:20] the replicas will catch up soon, preparing s3 replication [06:38:38] will set it up and will ask you to double check [06:42:37] s3 replication channel is on db1070 [06:42:47] will double check the position [06:48:53] yep, we are at the right position [06:49:45] will wait for you to do a thorough check of filters and everything in general before starting replication [06:50:01] (no hurry, I prefer to be slow with this) [06:51:29] hey [06:51:34] Sorry I was getting some breakfast [06:51:35] Let me check [06:52:00] checking filters [06:52:16] of course, I supposed so (I am going todo that myself now) [06:52:21] no hurry [06:53:07] this took 24 hours and could be broken in 1 second, take any time you want, please! [06:53:36] (I am reenabling innodb and binlog consistency) [06:54:12] Replicate_Wild_Do_Table: enwikivoyage.%,cebwiki.%,shwiki.%,srwiki.%,mgwiktionary.% that looks consistent with the new databases showing up on show databases; [06:54:22] It is also well-written a grep works for all of them [06:54:32] We do not want to do heartbeat [06:54:36] Confirmed? [06:54:59] I think it shouln't matter [06:55:11] Yeah, just asking because if we want to, we have to enable it too [06:55:14] but in the long run, we don't, assuming we replicate directly from codfw [06:56:03] it was not imported, so worse case scenario, we can enable it at a later time [06:56:08] sure [06:56:11] so the filters look good [06:56:15] I haven't checked the position [06:56:17] You want me too? [06:56:18] but I don't want garbage for now [06:56:26] I checked with the binlog [06:56:32] cool [06:56:34] of the master [06:56:37] based on ids [06:56:45] it should be ok [06:56:50] cool [06:56:53] let's go for it then? [06:57:00] so ok for start slave 's3'? [06:57:03] yep [06:57:35] wait one sec [06:57:47] What will happens with labs? [06:58:00] well, same than the import [06:58:09] Yep, the filter is there still [06:58:10] it will be replicated to dbstore1002 [06:58:10] good [06:58:13] but not on labs [06:58:22] if for some reason we reboot the servers, we need to remember to put it back [06:58:23] we have to change labs after caought up [06:58:25] (the filter I mean) [06:58:34] (stop servers in sync, etc.) [06:58:40] yeah [06:58:48] actually, it gets stored [06:58:53] I think [06:59:09] but I want to change it as soon as things caought up [06:59:11] We always have the same doubt haha [06:59:12] yeah [06:59:14] to the right place [06:59:15] make sense [06:59:23] let's go for db1070 's3' start slave then [06:59:25] that needs applying the filters [06:59:35] as they are "new" tables [06:59:59] (sanitarium ones on s5 are not redacted= [07:01:02] repl flowing now [07:01:11] checking for any breakage [07:01:17] so far so good [07:01:36] it shoudl catch up the 1-2 days of delay [07:01:44] and then we can check properly the schemas [07:01:53] and data [07:02:48] so we need to redact those new databases on sanitarium [07:02:58] yep [07:03:02] once everything has been switched over [07:03:05] and then switch filters [07:03:07] and caught up and all that [07:03:07] yeah [07:03:30] dbstore1002 in theory should be ok in the current state [07:03:44] although with the memory it has it may not finish replicationg ever [07:03:58] (for innodb) [07:07:38] Oh, I am going to run the check private data on the sanitarium host, to make sure it still works as expected [07:07:45] now that we for sure have no sanitized data [07:09:08] Looks like it is working as expected so far [07:15:16] dump.s6.2018-10-04--06-23-01 | failed [07:15:27] due to the test database, dropped [07:15:36] will try again after the others complete [07:18:56] note beliving we will have more potential problems until replication catchup and we do more stuff, I will be taking a break to my fasting now [07:19:02] *non [07:19:17] jynus enjoy the tostada! [07:28:17] after googling what's a tostada, I want one too! [07:28:25] (good morning) [07:28:51] XDDDD [07:49:41] 10DBA, 10Growth-Team, 10Operations, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10Marostegui) [08:18:24] jynus: once db1070 I will run the schema change. It needs to be run before we remove any filters or anything on labs, as labsdb hosts have the schema change already [08:19:35] Oh, it just went thru in db1070 \o/ [08:19:38] thru replication [08:20:54] 10DBA, 10Growth-Team, 10Operations, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10jcrespo) Apparently @Mattflaschen-WMF is no more in charge, who is in charge of flow maintenance now, maybe #gro... [08:21:22] ? [08:21:39] not sure if you mean that you did it or it was hidden on the replication already applied [08:21:49] No no, it was done thru replication automatically [08:22:01] that is why I asked to stop- I didn't know it [08:22:12] but there was a chance for replication breaking [08:22:26] :) [08:22:40] but it didn't fully caught up, right? [08:22:44] Yeah, when I asked I wasn't sure about the state of db1070 [08:23:00] jynus: not yet, only 3k seconds behind [08:23:05] so less than an hour! [08:23:05] I will do some data checks, can you do some schema checks? [08:23:11] yes, I am doing so :) [08:23:17] like a mysqldump -d to diff [08:23:20] cool, thanks [08:23:44] yeah, I am doing that with the latest touched tables [08:46:35] jynus: db1070 vs db1075 diff is clean for the wikis we moved. (clear for the tables we have touched in the last few months), the usual schema drifts are obviously there (tmp1 indexes and stuff like that) [08:53:06] I had to delay the data checks [08:53:12] will do now [08:55:07] (was sheparding the eqiad backup, full coverage now except s6, ongoing) [08:56:48] 4 minutes left of eqiad maintenance [08:57:05] hahaha [09:19:28] So for the network maintenance, let's do db1073 first and then db1072 for instance? [09:19:50] I mean the upgrades [09:19:54] socket and all that [09:20:16] 72 is m1? [09:20:28] m3 [09:20:34] db1073 m5 [09:20:36] db1072 m3 [09:20:48] mmm [09:21:26] not sure how to handle phab [09:21:48] should we ask to put it in read only mode and point to the replica? [09:22:29] it may switch automatically anyway [09:22:41] but on hard read only it just doesn't work [09:22:41] It is already discussed [09:22:44] oh [09:22:48] It will just be on read only [09:22:50] you pinged phab admins [09:22:50] And that's it [09:22:53] yep [09:23:06] but they will take care of changing it? [09:23:09] Yep [09:23:14] ok, then [09:23:15] They will coordinate with arzhel [09:23:17] I didn't read that [09:23:21] sorry [09:23:25] To do it right before the network change [09:23:31] No, it was discussed on an email thread :) [09:23:43] You are not on that thread, as I wanted to filter all the back and forth :) [09:23:44] then it doesn't matter, we can do one each [09:24:01] We will just need to reload the proxies once the host is back [09:24:08] and the network maintenance for it is finished [09:25:52] take the one you feel more confortable with and I will take the other [09:26:09] I don't mind [09:26:14] We have to do the same on both [09:26:19] stop mysql, full-upgrade and reboot [09:26:29] I saw you commented on the tmp socket task [09:26:41] But not sure what you wanted to mean [09:26:52] As far as I know there is nothing puppet-related with db1073 [09:27:03] we just need to stop mysql and it will start with the proper path for the socket, no? [09:27:23] marostegui: it was a self reminder [09:27:31] well, also for you if you needed it [09:27:35] no socket configuration [09:27:40] it is hardcoded with a link [09:27:44] yeah [09:27:48] so just needs a reboot [09:27:53] exactly [09:27:57] I was like: what does he mean? [09:27:57] I got confused when I went to edit it [09:27:59] hehe [09:28:04] and put there a reminder [09:28:08] :) [09:28:13] but I guess it was too cryptic [09:28:26] it was late and it was more of a todo [09:28:32] yeah, it was _pretty_ late! [09:52:08] 10DBA, 10Epic, 10Wikimedia-Incident: Improve regular production database backups handling - https://phabricator.wikimedia.org/T138562 (10jcrespo) [09:52:11] 10DBA, 10Patch-For-Review: Finish eqiad metadata database backup setup (s1-s8, x1) - https://phabricator.wikimedia.org/T201392 (10jcrespo) 05Open>03Resolved a:03jcrespo All green! https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=Backup+of Pending "only" snapshots, binlogs and increme... [10:05:17] 10DBA, 10Goal: Design and prepare infrastructure for database binary backups - https://phabricator.wikimedia.org/T206203 (10jcrespo) p:05Triage>03Normal [10:09:40] 10DBA: Research options for producing binary backups (lvm snapshots, cold backups, mariabackup) - https://phabricator.wikimedia.org/T206204 (10jcrespo) p:05Triage>03Normal [10:10:07] 10DBA: Research options for producing binary backups (lvm snapshots, cold backups, mariabackup) - https://phabricator.wikimedia.org/T206204 (10jcrespo) a:03jcrespo [10:10:34] ^goal tasks created [10:11:22] I will wait some days for other procurement tasks to advance and to talk to alex to see about hardware [10:25:36] nice! [10:25:37] Yeah [10:39:32] jynus: I was wondering what to do with https://phabricator.wikimedia.org/T138562 we have a few epic tasks for backsup, should we merge them in one? SHould we close that (which is basically kinda the roadmap we have in the document) and open tasks as we set the goals up every Q with individual tasks? [10:40:47] hey, I saw a ping here from 2 days ago for a CR, is still needed? [10:41:12] volans: it is nothing urgent, and you will be more busy [10:41:26] but I though you may have comments on banyeks work [10:41:41] feel free to add me to the CRs [10:41:44] wmf-pt-kill [10:41:57] I believe this is the patch: https://gerrit.wikimedia.org/r/#/c/operations/debs/wmf-pt-kill/+/463931/ [10:42:39] ack, I'll try to have a look later today [10:42:51] probably not that much that patch in particular [10:43:03] but a general comment about the repo [10:43:03] ohhau [10:43:05] ohhai [10:43:40] you are good on that, and may help him with comments [10:43:54] sure [10:44:03] (I say not that one, because moritz already provided useful feedback) [10:44:10] but nothing agains it of course [10:44:28] it is the non-debian parts that I thought of you [10:45:03] yeah for hte debian parts there are multiple people with much more wisdom than me ;) [10:45:38] that is what I meant [10:46:25] marostegui: we can maybe leave the improve as epic [10:46:35] and detail it with the other ones? [10:46:41] in theory, the improve is an epic [10:46:49] and the one I just created isa Meta [10:46:55] tracking for other smaller ones [10:47:04] but I didn't know how to express that [10:47:28] But the improve is basically what we have in the backups doc, no? [10:47:29] there will be a procurement task and another [10:47:32] (kinda of) [10:47:34] yes [10:47:39] or it should be [10:47:51] but the new meta task is he 3-months goal [10:48:02] a smaller scope but still nto to be worked directly [10:48:06] not sure if that is clear [10:48:09] it is #goal [10:48:25] mark ask for those meta tickets (which for me is also a good idea) [10:48:32] Yep, saw it. My point is, should we keep the Epic one? [10:48:45] ok, I thougt you wanted to delete the goal one [10:48:55] Nooo, the goal one (the meta) I like those [10:49:09] My point is about the epic, which is what we have in the document, but with not many details [10:49:13] so you want to delete the epic one because it is too large? [10:49:17] So not sure if it is useful [10:49:25] Or just noise [10:49:31] well, it has a point [10:49:52] it is a tracker of other tasks, and we don't have a #backups tag [10:50:06] if we create or ask to create a #backups one, we can delete it [10:50:29] do you see what I mean, it tries to organize a lot of smaller pending tasks not part of this goal [10:50:49] Yeah, lets get a backup tag, that is useful I think and important enough to have one [10:50:57] we can either create a tag or convert it into tracking [10:51:24] but I don't want to lose the 20 subtickets :-) [10:51:30] no no [10:51:39] I will ask to get a new tag [10:51:47] It is an important project [10:51:48] they convert the noize you are right to point into an understandable signal :-) [10:52:02] so that was the only point of that task [10:52:15] being the parent of those, but we can create a tak instead [10:52:19] *tag [10:52:37] however, I would coordinate with other people [10:52:44] because #backups is generic [10:52:54] and that task technically is #database-backups [10:53:00] yep, I was thinking about database-backups [10:53:01] exactly [10:53:15] but maybe it makes no sense to have a small one? [10:53:32] Why not? Just a tag and a dashboard for it [10:53:53] so very open to suggestions, just I didn't have the patience to think about all the details and the request [10:54:11] e.g. is it a group-tag or a topic tag, etc. [10:54:28] should it be used for people that creates their own database tags? [10:54:32] *backups [10:54:58] I don't know, I will create a ticket and they can help us, they probably have more ideas [10:55:03] sure [10:55:41] so my plan is to stop s3 and the s3 replication on s5 [10:56:01] I rush out now for a quick lunch [10:56:02] wel,, stop replication on s3 master [10:56:20] and move the s5 master s3 conenction from codfw [10:56:32] while downtiming all s3 eqiad alerts [10:56:46] is that ok, does it affect other maintenance? [10:57:08] so you want to: stop s3 eqiad, move db1070:s3 channel to replicate from s3:codfw [10:57:12] right? [10:57:16] yes [10:57:22] do we have to sanitize stuff first? [10:57:30] oh, that is missing? [10:57:34] I haven't done it [10:57:41] it can be done later, I think [10:57:46] I think it is not coupled [10:57:57] I can do it too [10:58:14] trying to think if there are any implications of doing it before or after [10:58:16] the dependency [10:58:19] is on renaming the tables [10:58:30] and setting a filter on s3 [10:58:36] which I am not going to do yet [10:58:54] and moving the filter on labs [10:58:55] yeah, if you change it from s3 eqiad to s3 codfw nothing should change regarding data/replication [10:58:58] it can be done later indeed [10:59:12] for now I wanted to to the dangerous step (affecting prod) [10:59:29] then prepare the filters [10:59:39] both on prod and on labs [10:59:44] and finally deploy the config change [10:59:46] yeah [10:59:57] you want to do that now or after lunch? [11:00:19] we can do it after lunch, but note I have a meeting at 14:30 CEST [11:00:40] we can do it now if you like [11:00:50] that is up to you [11:00:58] I don't mind having lunch at 2pm :) [11:01:02] it 15:30 ok? [11:01:05] sure [11:01:10] the network maintenance is at 4? [11:01:10] then then [11:01:13] aaah no 4utc! [11:01:15] right! [11:01:15] yeah [11:01:19] 15:30 is good [11:01:23] think of the fine details [11:01:27] I will also think [11:01:30] and we can compare notes [11:01:36] sounds good [11:01:59] I will get some lunch now then [11:03:59] b*nyek|afk: the package seems good to me, but I would wait for v*lans feedback, he always has very good feedback to provide that may save you a lot of time in the long term [11:07:48] if you get bored (hopefully not), start thing about doing the same treatment for our patched version of pt-heartbeat-wikimedia [11:08:01] *thinking [12:26:39] ok, I'll wait [12:31:24] jynus banyek check our etherpad on line 6 till 13, I have added next steps for the wikis movement [12:31:36] I am still thinking about them if we are missing something [12:51:36] jynus: I start a few other things before that, but where's the patched version we use? I mean, anywhere we use it, I can find that? [12:53:10] banyek: on any master :-) [12:53:36] And on modules/mariadb/files/pt-heartbeat-wikimedia [12:54:00] :) That was what I was hoped for [13:31:33] around [13:46:33] * banyek afk for like ~15 mins [14:01:21] I see you are making changes to the "plan" [14:01:24] good [14:01:45] see what you think [14:01:56] I am proposing to stop replication everywhere [14:02:01] yeah, fine with that [14:02:03] it is safer [14:02:16] but the sanitization could be done in advance [14:02:28] yeah, it doesn't matter [14:02:56] I've also scheduled the patch for deployment on Deployments [14:03:05] so it is merged today [14:03:29] 10DBA, 10Growth-Team, 10Operations, 10StructuredDiscussions, 10WorkType-Maintenance: Setup separate logical External Store for Flow in production - https://phabricator.wikimedia.org/T107610 (10kostajh) @jcrespo yes, #growth-team is handling #structureddiscussions. > Not doing this may soon block T106386... [14:03:34] Ah, I see it [14:03:54] I think I didn't really change your plan, just rearranged it [14:04:11] correct :-) [14:04:14] We are in an agreement! [14:04:41] so do you have time now? [14:04:46] I don't get line 19 [14:04:50] But we can discuss later [14:04:53] Let's do it now! [14:05:25] 19-on are to be done the day before the switch [14:05:33] not on codfw though [14:05:34] stop sending updates to s3 [14:05:39] with a filter [14:05:51] but not now [14:05:58] yep [14:06:02] let's do the other steps then [14:06:04] and rename the tables in advance [14:07:04] who runs the redact, you or me? [14:07:08] 10DBA, 10User-Banyek: BBU problems dbstore2002 - https://phabricator.wikimedia.org/T205257 (10Papaul) @Marostegui yes next Thursday works for me. [14:07:09] I ca do it [14:07:17] can [14:07:18] let me check it [14:07:25] we need to run it on eqiad only? [14:07:30] only on db1124 [14:07:32] as they won't be on codfw [14:07:35] yep [14:07:47] I will check it has ran [14:07:48] so we need to run it on db1124:3315 [14:07:51] ok [14:07:59] do you need the db list? [14:08:05] enwikivoyage cebwiki shwiki srwiki mgwiktionary [14:08:07] right? [14:08:18] yep 5 in total [14:08:37] I will check the user table to see if it gets sanitized and then I will log with my user to see if my "new" user gets sanitized too [14:09:06] I was going to fo that, but I guess I can double check [14:09:09] yes [14:09:12] let's both do that [14:09:54] tell me when ran [14:10:01] about to hit it [14:10:17] back [14:10:21] meanwhile, I will start downtiming lag on may servers [14:10:24]