[00:50:35] FIRING: MysqlReplicationThreadCountTooLow: MySQL instance pc1013:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=pc1013&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [04:50:35] FIRING: MysqlReplicationThreadCountTooLow: MySQL instance pc1013:9104 has replication issues. - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=pc1013&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationThreadCountTooLow [05:04:01] I've downtimed it [07:14:35] ta [10:09:43] jynus: could I ask you to look at T375448 please? The object in question is in the backups; I think you prefer not to simply restore the backup object back into the relevant swift containers, but rather put it somewhere and leave a commons admin to re-upload? [10:09:44] T375448: For some commonswiki pages, the imageinfo URL returns file not found - https://phabricator.wikimedia.org/T375448 [10:11:27] Emperor: on a meeting, will read and get back to you soon [10:11:33] TY :) [10:11:50] usually my preference is not to overwrite unless it is clear backend error with simpler action [10:11:57] and to reupload it [10:12:09] but will got back to you soon [10:14:30] 👍 [10:45:47] Emperor: I am going to comment with the same thing that I said. if you have the file handly, just upload it to phab and ask if they want to reupload it, or I can if you just checked the dbs but didn't download it [10:46:39] otherwise we can do it for them, but as a new upload. I like that as it is more trazable and to end users there is not difference (they care about the Image title, not as much the upload) [10:48:05] jynus: I've not download it, no (just checked both DC's backups thought they had it) [10:53:22] I will do it myself then [10:53:25] no worries [10:53:52] that why I answer already directly with the solution [10:54:27] we should talk, however, on a solution for a mass recovery [10:55:04] as there could be a use case where huge collections need to be recovered and we should come up with an automation for that [10:55:12] (for the future) [10:56:21] let me also double check the hash and the permissions of the file [10:58:13] yeah, confirming it is a public file and that the hash is the same: https://commons.wikimedia.org/w/index.php?title=File:103rd_Street_Station,_New_York_(9679771161).jpg&action=info [11:02:25] it is too large for phabricator [11:02:27] :-( [11:02:36] will upload it to people.wm [11:03:02] Thanks! [11:03:12] if only we have a proper file storage service for this things! (jk) [11:07:27] *had [11:07:38] *these [11:07:58] that would be an argument in favour of just shoving it straight back into swift ;-) [11:08:05] yeah [11:08:13] but it is common courtesy too [11:08:24] we let the users handle it (that's my take, ofc) [11:10:03] https://phabricator.wikimedia.org/T375448#10170718 [11:13:49] hey, looking on the bright side, another file lost, another file succesfully recovered from backups CC kwakuofori [11:16:56] 99.9% media backup coverage is "kenough"! https://c.tenor.com/sB094nVU5TcAAAAC/tenor.gif [11:36:38] I am having some weird desktop decorator glitches, I am going to restart my desktop [11:47:54] jynus: \o/ [12:16:00] arnaudb: zarcillo was wrong, it still marked pc1015 as a replica [12:17:31] UPDATE section_instances SET section='pc3' where instance='pc1015'; [12:32:17] please drop anything you are doing, we have a potential switchover blocker [12:33:05] https://phabricator.wikimedia.org/T375186#10171002 [12:33:30] just got back in front of my keyboard [12:33:31] checking [12:33:56] * Emperor around (but I think not likely to be of assistance) [12:37:53] I'm looking for the related change in phabricator but research timed out :o [12:38:51] * volans same [12:39:08] * volans same of emp.eror to be clear ;) [12:42:09] jynus: this doesn't seem to come from a schema change going wrong, am I right? it's more a data consistency issue as I understand it → I don't see any recent action on change_tag: https://phabricator.wikimedia.org/search/query/R0C0gT_jNo0o/ [12:44:01] db[1157,1166,1175,1189,1198,1212,1223].eqiad.wmnet have it [12:44:05] the rest don't [12:45:40] there is a split brain there [12:46:32] my suggertion would be to backup the row, REPLACE it and then DELETE it, both with replication enabled [12:47:04] then check if it is a one time thing or something more generalized [12:47:10] thoughts? [12:47:12] I must admit I'm out of my depth here, your plan sounds sane to me [12:47:22] I'll track it in the task [12:47:25] Amir1 or riccardo? [12:47:38] I'm about to go to a doctor [12:47:43] no worries [12:48:01] jynus: how many hosts have it and how many don't? (to get the ratio) [12:48:01] I just want anyone else to sanity check it [12:48:05] Can we check when the split brain was caused? [12:48:23] unless it happened in the last month, not easilly [12:48:24] Amir1: I'll add it to T375507 scope [12:48:24] T375507: Consitency issue on change_tag - https://phabricator.wikimedia.org/T375507 [12:48:38] so we take note of it and try [12:48:44] I dont think this is the first priority [12:48:44] if it's old, I think it can stay. I have seen issues with change_tag and abuse filter having different values [12:49:04] jynus: we can try to see when was the first edit that had that tag? [12:49:08] so, in that case it would be a similar action: replace but not delete [12:49:31] let me see [12:50:28] the revision is from 2018 [12:50:40] let it be [12:50:45] noted in the task [12:50:50] I have seen issues like this before [12:50:51] (the edit, not the tag) [12:51:11] the tags are usually added the time of the edit, some cases a day or two later [12:51:16] fair, then I will run REPLACE with replication enable reinserting the row everywhere [12:51:19] regardless, they are not even critical information [12:51:29] yeah, I am not worried about the data [12:51:34] but about replication breaking [12:51:43] yeah [12:52:19] I really have to go, I'm not feeling well in general, I try to help later [12:53:01] take care Amir1 [12:54:03] arnaudb or volans a sanity check? https://phabricator.wikimedia.org/T375186#10171265 [12:54:10] checking [12:54:33] * volans same [12:54:48] will move to the dedicated task [12:54:53] lgtm [12:55:05] jynus: isn't missing a where? [12:55:37] ha, let me double check the replace syntax [12:55:37] or the replace get's it from ct_id in the values [12:55:44] and so it's not needed [12:55:50] sorry I'm rusty [12:56:22] my query interpreter accepts it no problem [12:56:28] yeah, it is the same as insert (replace is an upsert) [12:56:30] not sure how it'll be actually interpreted [12:56:32] into is optional [12:56:38] notbad [12:56:55] LGTM [12:57:29] is there any host with raw based that might fail the replace? [12:57:55] this will be run on hosts with replication disabled, right? [12:58:01] not on s3 [12:58:06] this is run in statement [12:58:36] I mean, wikireplicas use raw, but that should not be an issue, unless sanitarium and wikireplicas have different data [12:58:48] k [12:58:49] which would be its separate issue [12:59:02] and should be easy to fix, but it is not the case based on my run [13:00:10] ok, doing [13:00:27] will monitor replication, feel free to ping me [13:00:44] thx [13:00:48] uh, 2 rows affected [13:01:49] I gess it is ok, a delete + insert, or something is very wrong (ct_id is a pk) [13:02:10] replication looks ok [13:02:14] will run the table check again [13:02:48] all 15 hosts now have the row [13:02:57] 🎉 [13:03:30] maybe this is "normal", just I am not accostumed to it [13:03:51] compare says first 3000 rows look fine [13:04:02] I know that there is some inconsistencies across our fleet, there is no silver bullet outside of wiping/mirroring every single hosts (which is doable but time costly) [13:04:18] let's resolve the ticket and I will keep doing the check [13:04:42] i'll let you claim it jynus as you did all the work here [13:04:55] just asign + resolve, thanks for filing it! [13:05:00] <3 [13:05:03] great [13:05:30] if change_tags are for the most part append only, this is not an issue [13:05:37] or not a common issue [13:05:53] but we have alredy bitten by row difference once last week [13:05:57] will continue the check [13:15:05] I am seeing other issues potentially, I will reopen the ticket but not make it ubn if it is just change_tag [13:15:13] will make it private temporarilly [13:15:23] https://phabricator.wikimedia.org/T375507 [13:15:32] reopened [15:48:29] FYI, folks, in case it's useful: the swift-rw.discovery.wmnet A/P service is switching to codfw right about now [15:55:21] shouldn't make much difference to swift, since mw writes to both clusters most of the time anyway [15:59:31] thanks, Emperor!