[04:59:56] 10DBA, 10Patch-For-Review: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 (10Marostegui) [05:02:37] 10DBA, 10Patch-For-Review: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 (10Marostegui) 05Open→03Stalled This is all done except the master, which will be done once the DC switchover is done and eqiad is stand by. [05:19:58] 10DBA, 10User-Urbanecm: Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10Marostegui) I would prefer if we updated the docs and the phab template to point it to s5 (once we are fully ready for it) and even send an email to wikitech-l to make sure everyone knows... [05:22:38] 10DBA, 10Patch-For-Review: Update shard descriptions in db-eqiad/db-codfw - https://phabricator.wikimedia.org/T259437 (10Marostegui) p:05Triage→03Medium a:03Marostegui [05:22:44] 10DBA, 10User-Urbanecm: Establish process of determining shard for new wikis - https://phabricator.wikimedia.org/T259438 (10Marostegui) p:05Triage→03Medium [05:23:45] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for lldwiki - https://phabricator.wikimedia.org/T259436 (10Marostegui) p:05Triage→03Medium Let's go for s5! [05:24:45] 10DBA, 10Patch-For-Review, 10User-Urbanecm: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 (10Marostegui) >>! In T259004#6353151, @Urbanecm wrote: > @Marostegui Hello, please note my availability will be limited during August 4-6. Since your vacation en... [07:44:23] 10DBA: Code and production differs on s3 on imagelinks table - https://phabricator.wikimedia.org/T259232 (10Marostegui) List of affected wikis: ` acewiki arbcom_fiwiki arwikimedia arwikiversity aswikisource bdwikimedia bewikimedia bewikisource bjnwiki boardgovcomwiki brwikimedia brwikisource checkuserwiki ckbwik... [07:56:16] 10DBA: Code and production differs on s3 on imagelinks table - https://phabricator.wikimedia.org/T259232 (10Marostegui) [08:05:31] 10DBA: Code and production differs on s3 on imagelinks table - https://phabricator.wikimedia.org/T259232 (10Marostegui) s3 eqiad [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1004 [] db1124 [] db1123 [] db1112 [x] db1095 [x] db1078 [x] db1075 [08:23:50] 10DBA, 10Patch-For-Review: Update shard descriptions in db-eqiad/db-codfw - https://phabricator.wikimedia.org/T259437 (10Marostegui) 05Open→03Resolved Done with https://gerrit.wikimedia.org/r/617876 We can update it once s5 is the default too [08:36:36] 10DBA: Code and production differs on s3 on imagelinks table - https://phabricator.wikimedia.org/T259232 (10Marostegui) [08:36:44] 10DBA: Code and production differs on s3 on imagelinks table - https://phabricator.wikimedia.org/T259232 (10Marostegui) 05Open→03Resolved All done [08:38:39] I am checking https://logstash.wikimedia.org/goto/aa4e6616bb0a1dfed5fd2b09faf257cd [08:38:47] But so far I cannot find anything wrong [08:38:53] what did you break this time? ;) [08:39:06] Maybe the usual mw false positives? [08:39:39] I did a weight changes around that time https://phabricator.wikimedia.org/P12139 [08:39:42] but not a super big thing [08:40:04] Also nothing on https://grafana.wikimedia.org/d/000000278/mysql-aggregated?panelId=6&fullscreen&orgId=1&var-site=eqiad&var-group=core&var-shard=s1&var-role=All [08:40:24] There are no fatals or anything [08:43:49] marostegui: does mw use heartbeat for detecting lag? if so, that might not show up on our dashboards [08:44:40] yes, it does for sX, but not for es (just for the record) [08:44:46] kormat: But I have checked pt-heartbeat too [08:46:26] the server with more errors is db1119, but I cannot see it being broken in anyway [08:46:48] Maybe I can just depool it [08:46:52] And see what happens [08:47:02] But from what I can see it is performing fine [08:48:00] Ah no, most of the errors from that one are from hours ago [08:48:25] https://logstash.wikimedia.org/goto/76c904e3c429b8bb48184b987969507d [08:50:37] Nothing wrong on https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1 either [08:53:40] The only thing I can think of is that I stopped s7 codfw master at around that same time for MCR and MW is going nuts...but that doesn't match,as it reports errors on enwiki and not on s7, although s7 has centralauth [08:54:02] that could explain why it is not really affecting anyone [08:54:18] It matches the time almost perfectly https://grafana.wikimedia.org/d/000000273/mysql?panelId=2&fullscreen&orgId=1&from=now-6h&to=now&var-server=db2118&var-port=9104 [08:54:51] 10DBA, 10Patch-For-Review, 10User-Urbanecm: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 (10Urbanecm) Sounds good to me @Marostegui :). [08:55:18] 10DBA, 10Patch-For-Review, 10User-Urbanecm: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 (10Marostegui) Excellent, thank you. I will block that Window on the deployments page [08:56:21] FYI, for the db1082 not paging in VO issue I've opened https://phabricator.wikimedia.org/T259465 [08:56:31] thanks, subscribed! [08:57:10] np! personally not sure yet on the proper solution, everything was working as intended [08:57:20] anyways! [08:58:05] godog: is there a way to send a reminder? like: this incident is still open... [08:59:56] good question marostegui, I don't know [09:00:16] we'll find out though! could you update the task too with the same question ? [09:05:09] sure! [09:10:10] thanks! [10:45:41] 10DBA: Code and production differs on s3 on templatelinks table - https://phabricator.wikimedia.org/T259241 (10Marostegui) [10:45:51] mmm. i see what looks like a bug in profile::mariadb::replication_lag [10:46:02] that might explain why i can't find the alert is icinga [10:50:02] yeaah, that's it. [10:58:05] 10DBA: Code and production differs on s3 on templatelinks table - https://phabricator.wikimedia.org/T259241 (10Marostegui) eqiad progress [] labsdb1012 [] labsdb1011 [] labsdb1010 [] labsdb1009 [x] dbstore1004 [] db1124 [] db1123 [] db1112 [x] db1095 [x] db1078 [x] db1075 [11:22:28] 10DBA: Code and production differs on s3 on templatelinks table - https://phabricator.wikimedia.org/T259241 (10Marostegui) [11:22:41] 10DBA: Code and production differs on s3 on templatelinks table - https://phabricator.wikimedia.org/T259241 (10Marostegui) 05Open→03Resolved All done [11:49:54] 10DBA: Code and production differs on s3 on templatelinks table - https://phabricator.wikimedia.org/T259241 (10Marostegui) For the record, the affected wikis were: ` acewiki arbcom_fiwiki arwikimedia arwikiversity aswikisource bdwikimedia bewikimedia bewikisource bjnwiki boardgovcomwiki brwikimedia brwikisource... [11:53:52] 10DBA: Code and production differs on s3 on pagelinks table - https://phabricator.wikimedia.org/T259238 (10Marostegui) Affected wikis ` acewiki arbcom_fiwiki arwikimedia arwikiversity aswikisource bdwikimedia bewikimedia bewikisource bjnwiki boardgovcomwiki brwikimedia brwikisource checkuserwiki ckbwiki cowikime... [12:13:42] 10DBA, 10Patch-For-Review, 10User-Urbanecm: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 (10Marostegui) @Urbanecm would you have time to prepare MW patches before the maintenance day so they can be reviewed by @Ladsgroup and myself? Thanks for your su... [12:17:22] nice, revision on eswiki went from 50GB to 14GB after MCR [12:19:27] marostegui, jynus: wmfmariadbpy is installed on cumin2001 now. can you test to see if mysql.py works for you? thanks [12:19:35] i've tried it against a few hosts successfully [12:19:36] shure [12:19:37] sure [12:20:22] `db-replication-tree` works too [12:22:39] `db-switchover` does _not_ work [12:29:33] kormat: there are some hosts showing up as lagging with the lag alert [12:29:36] the new one I mean [12:30:06] Ah right, those are s7, I downtimed them, but that alert was added after I downtimed them [12:31:30] ah hah [12:31:40] do we need to do anything about them? [12:31:44] nah [12:31:47] just downtimed them [12:31:52] +1 [12:31:55] those hosts are indeed lagging as their master is stopped [12:32:04] they're lazy [12:36:04] ok, unless anyone objects, i'm going to proceed with the rollout of wmfmariadbpy [12:36:13] i'll follow up with the db-switchover issue after [12:36:19] kormat: did you find why switchover wasn't working? [12:36:26] marostegui: yeah import failure [12:36:42] ok [12:36:51] the executable expects to be in the same dir as the CuminExecution library [12:36:56] so it's a simple fix [12:36:59] if we need to do an emergency failvoer we can just use the git repo one, right? [12:37:04] yep, exactly [12:37:09] good [12:37:11] mmm [12:37:12] one sec [12:37:18] and in any case i should have the fix deployed this afternoon [12:37:22] can you wait a few minutes before removing mysql.py from cumin1001? [12:37:30] sure [12:37:37] I am finishing an alter table, using it, it shouldn't cause anything, but let's not mess with it [12:37:43] what i've done on cumin2001 is chmod -x /usr/local/bin/mysql.py for the moment [12:37:45] +1 [12:39:00] I think it will be done in around 10 minutes or so [12:40:09] ACKNOWLEDGEMENT - 5-minute average replication lag is over 2s on db2095 is CRITICAL: 2512 ge 2 Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2095&var-port=13317&var-dc=codfw+prometheus/ops [13:08:53] kormat: schema change finished [13:10:48] great [13:11:09] alright, doing cumin1001 now [13:11:52] ok! [13:12:30] mysql.py and db-replication-tree both work [13:12:49] alright, deployment done [13:14:10] 10DBA: Set up and package wmfmariadbpy helper scripts so they can easily be deployed to all database server and client hosts - https://phabricator.wikimedia.org/T165358 (10Kormat) [13:14:57] you mention switchover not working? [13:15:00] *mentioned [13:15:33] jynus: yes. i'll be sending a CR to fix that shortly. [13:15:51] ok, I wanted to help, but if you have it controlled, no issue [13:16:03] I have restarted slave on db2118, we'll see if logstash goes back to normal once it has caught up [13:16:12] and if it does...it makes no sense and I will create a ticket for CPT [13:16:21] jynus: i'll send the CR to you :) [13:16:24] we can do a stop in sync on codfw [13:16:32] and some other stuff [13:37:26] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) [13:40:28] hello, batm! [13:41:29] batm!! [13:50:56] so s7 slave being stopped was the cause...ticket incoming! [13:51:38] ah no, maybe not, logstash starting to draw again [13:51:45] I will create a task anyways [13:59:07] jynus: o/ I have a question about our backup chat of last week, if you have 5 mins (otherwise even tomorrow, no rush) [14:00:17] give me one second, I have another ongoing chat [14:01:08] oh yes np [14:16:22] kormat: see why it is confusing? "which mysql.py -> /usr/bin/mysql.py" "mysql.py -h db1111 sys -> bash: /usr/local/sbin/mysql.py: No such file or directory" [14:16:33] I don't like that bash functionality [14:18:00] heh [14:19:25] checkhash option - If set, bash checks that a command found in the hash table exists before trying to execute it. If a hashed command no longer exists, a normal path search is performed. [14:20:35] Re: transferpy I don't know what to do [14:20:47] it feels wrong to make this depend on transferpy [14:20:56] i agree. give me a minute, i'll write up something. [14:21:00] for just a stupid wrapper [14:21:16] but duplicating it is also worse [14:25:14] 10DBA, 10Operations, 10User-Kormat: DBA python layout - https://phabricator.wikimedia.org/T259516 (10Kormat) [14:25:22] 10DBA, 10Operations, 10User-Kormat: DBA python layout - https://phabricator.wikimedia.org/T259516 (10Kormat) p:05Triage→03Medium [14:25:24] jynus: ^ that's my proposal [14:26:34] looks good, although we should take the RemoteExecution version from transferpy [14:26:45] sure [14:26:54] I think it looked healthier and more complete, let me check [14:27:57] https://github.com/wikimedia/operations-software-transferpy/tree/master/transferpy/RemoteExecution [14:28:39] https://github.com/wikimedia/operations-software-transferpy/blob/master/transferpy/test/unit/test_CuminExecution.py [14:28:59] at least 45 lines healthier :-) [14:29:45] we can also sync that work with a new transferpy release [14:36:10] elukey: shoot here or on pm [14:38:19] jynus: sure! So I was trying to imagine a failure scenario for my use case, and the related steps to recover data etc.. (while writing the docs). The doubt that I have currently is about restoring a known good state of a database on an-coord1001 (where mariadb handles multiple dbs in the same instance) [14:38:55] from the binlog, since IIUC it would involve the state of all databases [14:39:27] it is of course a problem that we (as analytics) created having a single instance with multiple dbs [14:39:34] so you mean that e.g. analytics_1 db breaks and the other are still ok? [14:39:40] exactly [14:40:07] say that an upgrade goes wrong, an app using the db misbehaves, etc.. [14:40:08] and there is analytics_2 on the same process (not only on the same server) [14:40:28] yep [14:40:31] first, if that is likely, having multiple instances is nicer [14:40:34] up to a scale [14:40:50] e.g. we cannot have 900 instances on the same server, that is very inefficient [14:41:05] but we can have no problem, e.g. 5 [14:41:24] but even if that is the case, the issue is never being able to recover stuff [14:41:28] it is how fast you can do it [14:41:42] with mysqldump, you used to be able to recover only the full instance [14:42:00] now with mydumper, you can recover only a single db, table or even rows of a table [14:42:07] that you agree, right? [14:44:12] elukey, still there? [14:45:11] jynus: I am sorry [14:45:25] if busy we can continue at a later time [14:45:44] nono I am good, I got sidetracked by a task [14:45:57] so please read and agree (or not first) [14:46:25] I agree yes, I was about to follow up saying that the weekly backup might not be enough for my use case [14:47:01] yeah, and that is, as I mentioned before, where the binlog takes effect [14:47:02] assuming using mydumper as indicated from the docs [14:47:24] and we can apply the binlog for only 1 db, if that is what is necessary [14:48:08] ok so it is possible, if one wants, to recover the state of a single db from the binlog? [14:48:21] (leaving the others unchanged) [14:48:28] so I think the issue is that I didn't properly explain what the binlog is [14:48:39] it is just a registry of all the changes applied [14:48:54] so it can be filtered on apply [14:49:11] and apply only some of the changes- filtery by db, table, time, etc [14:49:14] okok this is what I wanted to know, I haven't done it before [14:49:34] I really suggest you to try on a vm or otherwise outside of production [14:49:40] if you do it it really clicks [14:49:52] it is not difficult, but it takes some practice [14:50:00] * elukey nods [14:50:04] e.g. simulating a DROP [14:50:24] and recovering a table up before that [14:51:06] if speed of recovery is very important, as I said, we can add additional daily snapshots [14:51:15] but that requires extra resources [14:51:57] nono I just need a path to recovery that is known to analytics admins if we are in trouble (and you folks are not around etc..) [14:51:59] we only do it for mw metadata databases [14:52:33] I'll do some tests with the binlog, dropping a table and recovering seems to be a good use case [14:52:36] we could maybe do a simulation with analytics admins and kormat [14:52:43] if that helps [14:53:10] it would yes [14:53:18] my biggest fear is "of course I understood" and then realizing too late it takes some reading [14:53:50] it is true that full recovery is what is normally more automated [14:54:02] other things being more manual, that will be fixed with time [14:54:29] I will talk to my team to see if there is interest, talk to yours [14:54:35] and we can also propose it to other sres [14:54:45] for the DROP table recovery use case, if I want to use the binlog, do I need to start from a known backup and then re-apply statements up to a certain point? (just to understand the overall procedure) [14:54:58] correct [14:55:02] ah okok now I got it [14:55:39] but the backups need the coordinates [14:56:18] that is why we do gather them both with centralized dumps and snapshots [14:56:50] also because dumps as slower to recover, it normally doesn't make sense to have a lot of them [14:58:09] ACKNOWLEDGEMENT - 5-minute average replication lag is over 2s on db2077 is CRITICAL: 6474 ge 2 Marostegui known https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2077&var-port=9104&var-dc=codfw+prometheus/ops [14:59:45] super, I have some reading to do [15:00:04] thanks for the brainbounce, I'll probably ask more n00b questions :) [15:00:15] I love people asking questions [15:00:22] not many people have the curiosity [15:01:37] Manuel thought the same with my questions, after a while he added a > /dev/null redirect on IRC when I ping him [15:01:49] I always need to find new things/words to bypass his filters [15:01:53] :D [15:02:06] :D [15:03:38] no, but it is true, I am trying to get more people to know procedures for emergency recovery [15:07:55] yep yep jokes aside this is really great, we should avoid the "I'll read the procedure when the time comes" [15:08:39] awww, schema drifts bitting me again [15:09:17] how so? [15:10:38] jynus: ^ [15:10:43] ups [15:10:43] 10DBA, 10Cloud-Services, 10MW-1.35-notes (1.35.0-wmf.36; 2020-06-09), 10Platform Team Initiatives (MCR Schema Migration), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) Today while altering frwikti... [15:10:45] jynus: ^ [15:28:19] 10DBA: Review revision table and make sure that the PK is always rev_id - https://phabricator.wikimedia.org/T259524 (10Marostegui) [15:36:48] 10DBA: Review revision table and make sure that the PK is always rev_id - https://phabricator.wikimedia.org/T259524 (10Marostegui) [15:42:07] 10DBA: Review revision table and make sure that the PK is always rev_id - https://phabricator.wikimedia.org/T259524 (10Marostegui) [16:03:32] 10DBA, 10Patch-For-Review, 10User-Urbanecm: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 (10Urbanecm) Sure! So, as a mental recap, we will need a readonly patch for both muswiki and mhwiktionary, a change of db-eqiad/db-codfw files, and a revert of th... [16:09:30] 10DBA, 10Patch-For-Review, 10User-Urbanecm: Move muswiki and mhwiktionary (closed wikis) from s3 to s5 - https://phabricator.wikimedia.org/T259004 (10Marostegui) And also de yaml that will regenerate the new s5 dblist [16:09:49] heh, thanks marostegui ! [16:09:52] going to fix that too... [20:05:47] 10DBA: Review revision table and make sure that the PK is always rev_id - https://phabricator.wikimedia.org/T259524 (10Gilles) @Marostegui the timing of this issue seems to match {T259520} Do you think this was the cause? [20:09:08] 10DBA: Review revision table and make sure that the PK is always rev_id - https://phabricator.wikimedia.org/T259524 (10Marostegui) I doubt it, as those were just codfw replicas and only in s7, which is not enwiki.