[05:49:56] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 6 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) These tables are fine: ``` logging revision slots text ```... [05:50:53] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) [05:59:00] 10DBA, 10Lexicographical data, 10Wikidata, 10Datacenter-Switchover-2018, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) [06:07:20] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: wikibase: synchronize schema on production with what is created on install - https://phabricator.wikimedia.org/T85414 (10Marostegui) [06:07:23] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: compress wb_changes_dispatch on testwikidatawiki - https://phabricator.wikimedia.org/T207359 (10Marostegui) 05Open>03Resolved a:03Marostegui This has now been done: ``` root@neodymium:/home/marostegui# ./section s3 |... [07:28:39] did you stop db1092 and db1087 replication? [07:28:50] yes [07:28:56] when? [07:29:05] ˜/marostegui 8:39> !log Stop replication on db1092 and db1087 for checking T206743 [07:29:06] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [07:29:41] is that blocking for some reason? [07:29:47] *blocking you [07:29:58] well, I was checking db1071 with db1092 [07:30:03] ah shit [07:30:06] I didn't know [07:30:07] and just deleted some rows [07:30:07] let me start [07:30:16] because they were not there, as you said [07:30:42] you were supposed to check, but I was supposed to only modify things [07:30:49] unless you tell me so [07:30:53] yeah yeah [07:30:59] Let me start it again [07:31:30] done [07:31:37] should catch up quite quickly [07:32:05] I don't mind if you want to stop them [07:32:16] but I need to know to not use them for fixes [07:32:38] I stopped them to make sure change_tag tas not reporting false positives due to ongoing writes [07:32:42] but the idea is never to stop them again [07:32:49] no reason to keep them stopped again [07:37:18] So if I do INSERT INTO $TABLE (ct_id) VALUES (235206000) [07:37:27] that doesn't sync the autoincs [07:37:37] so I may have to do an alter table? [07:37:59] or as you said, get a gap [07:38:22] the insert doesn't work [07:40:10] db1087 and db1092 are now in sync with the master btw [07:41:45] should I run ALTER TABLE change_tag AUTO_INCREMENT=236000000; with binlog on the master? [07:42:00] are they? [07:42:09] or you mean replication only [07:44:49] should I do the alter? [07:45:23] My only thought is…should we maybe finish everything before altering it? [07:45:38] I cannot finish it without syncing the autoinc [07:45:57] because new ids will be out of sync [07:46:02] right [07:46:06] true [07:46:14] as soon as I fix them, the new ones are different [07:46:48] I am open for suggestions [07:47:14] that is why I was sure there where no differences [07:47:19] but now there are some [07:48:00] wait, they synced [07:48:02] finally [07:48:13] it was the lag confusing me [07:49:04] yeah [07:49:07] I posted it above :) [07:49:46] I need to leave for 30 minutes, sorry, kinda an emergency [08:36:41] I am back [08:40:30] you can stop the servers if you want, knowing you are going to touch them I will not [08:50:03] no, so far I don't need to [08:50:15] I will ask you before stopping, but so far nothing planned [08:50:25] it is now checking pagelinks [09:38:59] marostegui: I need to stop db1087 to properly fix db1124 [09:39:41] jynus: sure, give me 2:02 minutes [09:39:44] ;) [09:39:47] ok [09:40:27] you don't need to wait- all checks happen in the same transaction [09:40:58] (I think) [09:41:16] but if I am comparing it to a host that has replication stopped and the other doesn't, the table will report differences [09:41:19] or it is likely to [09:41:29] if you start it stopped yes [09:41:40] but if you stop mid-check not [09:41:43] (you can now stop it) [09:41:50] did it finish? [09:41:53] yep [09:41:57] ok [10:12:56] please don't send an email at a time, record the changes somewhere and then I can have a look [10:14:09] * marostegui nods [10:14:44] I said last tiem that email was ok because Ithought you were going to send 1 at the end of the day [10:15:38] ok [10:15:47] so the problem with db1124 is that change_tag was fixed there but not on its master, so by fixing the master, the replica broke [10:17:31] I see [10:17:39] Good that we caught it [11:28:06] I've restarted replication [11:28:21] but there may be more differences there now I have to fix still [12:02:01] :( [12:03:26] I did a check between db1092 and db1071 earlier and it reported no differences [12:08:53] which tables? [12:09:18] change_tag [12:09:34] I did a check for the earlier differences reported [12:09:38] (only for those ranges) [12:11:16] yeah, I fixed those, but because replication, it broke db1124:3318 [12:11:30] let me know when you want me to do another full change_tag check [12:11:38] I fixed change_Tag [12:11:44] but other tables broke [12:11:47] Oh [12:11:49] I get you [12:51:07] I think I fixed everthing that broke now [12:51:42] you are the best [12:51:52] I am now checking the last table, wb_terms [12:52:06] you can stop or do anything [12:52:12] ? [12:52:14] I am not going to touch s8 more for now [12:52:21] stop replication [12:52:25] Ah [12:52:28] No need to so far :) [12:52:29] Thanks [12:56:52] 10DBA, 10Lexicographical data, 10Wikidata, 10Datacenter-Switchover-2018, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10jcrespo) p:05High>03Normal We believe this to be fixed full... [13:04:48] 10DBA, 10Lexicographical data, 10Wikidata, 10Datacenter-Switchover-2018, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) More tables to check after the ones checked at: T20... [13:05:14] 10DBA, 10Lexicographical data, 10Wikidata, 10Datacenter-Switchover-2018, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) [13:13:10] 10DBA, 10Wikimedia-Incident: Compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 (10jcrespo) So this has to be done (I will check in case there is a duplicate task already), not arguing against that. My suggestion would be to make as an actionable, alternatively,... [13:18:01] 10DBA, 10Wikimedia-Incident: Compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 (10Marostegui) >>! In T207253#4688613, @jcrespo wrote: > So this has to be done (I will check in case there is a duplicate task already), not arguing against that. > > My suggestion w... [13:20:11] 10DBA, 10Wikimedia-Incident: Compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 (10jcrespo) > that task is mostly append only I am guessing table- it is mostly updates- every time an edit is done, a counter is updated there. Revisions is probably the one with mos... [13:20:49] 10DBA, 10Wikimedia-Incident: Compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 (10Marostegui) >>! In T207253#4688624, @jcrespo wrote: >> that task is mostly append only > > I am guessing table- it is mostly updates- every time an edit is done, a counter is updat... [14:26:14] lot of load on 2 enwiki instances [14:27:05] enwiki in general, with lots of writes [14:27:22] https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s1&var-role=All [14:27:53] although this is the largest read spikle [14:29:05] on sal around those hours [14:29:08] • 11:20 zfilipin@deploy1001: Synchronized wmf-config/throttle.php: SWAT: https://gerrit.wikimedia.org/r/469168 (duration: 00m 46s) [14:29:11] • 11:11 zfilipin@deploy1001: Synchronized wmf-config/InitialiseSettings.php: SWAT: https://gerrit.wikimedia.org/r/469180 (duration: 00m 47s) [14:56:37] I am going to remove the MOTD from cumin1001 [14:57:49] are we sure? [14:58:02] should we restart the server? [14:58:18] yeah, wouldn't be a bad idea after all the OOM [14:58:46] I will talk to volans tomorrow [14:59:02] I removed the MOTD because we are not doing any big big queries from it for now [15:00:24] yeah, but I didn't know if we will in the future [15:00:50] Yeah, the MOTD stated: Some heavy queries are being run from this host as part of the recovery for T206743 [15:00:51] T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 [15:00:54] Which we are not doing for now [15:01:02] ANyways, if you feel it should still be up, feel free to add it [15:01:10] there is not much point on spending time on it [15:02:31] I am out, see you tomorrow. bye [15:02:47] bye [15:07:38] re:cumin1001, sure a reboot wouldn't hurt, it can be done anytime if noone is using it, and just re-arm keyholder after the reboot [15:07:51] I can take care of it, just let me know when you're done :) [15:10:42] that is what I was telling manuel- we are not actively using it, but we are checking what was done [15:11:01] and if some check fail, we may have to start doing fixes again [15:12:17] on the other side, we don't want to take over resources we are not really using [15:14:21] 10DBA, 10Research: Beta labs: Request to create database and account for recommendation API - https://phabricator.wikimedia.org/T207756 (10bmansurov) [15:17:46] jynus: no problem for me to wait some additional days in case you might need to do additional checks [15:18:02] we still have other 3 hosts that share the same role :) [15:18:25] as we an postpone a bit the decom of sarin/neodymium [21:11:55] 10DBA, 10Research: Beta labs: Request to create database and account for recommendation API - https://phabricator.wikimedia.org/T207756 (10bmansurov)