[00:48:26] 10Blocked-on-schema-change, 10DBA, 10Wikimedia-Site-requests, 10Wikisource, and 2 others: Schema change for page content language - https://phabricator.wikimedia.org/T69223#3734908 (10Krinkle) [05:37:42] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1059 - https://phabricator.wikimedia.org/T179727#3735001 (10Marostegui) a:03Cmjohnson @cmjohnson, can we get the disk replaced? Thanks! [06:34:02] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, 10MediaWiki-Platform-Team (MWPT-Q2-Oct-Dec-2017): Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3735013 (10Marostegui) db2058 done: ``` root@neodymium:/home/marostegui/T174569# ./check.sh db... [12:12:28] 10Blocked-on-schema-change, 10DBA, 10Data-Services, 10Dumps-Generation, 10MediaWiki-Platform-Team (MWPT-Q2-Oct-Dec-2017): Schema change for refactored comment storage - https://phabricator.wikimedia.org/T174569#3735126 (10Marostegui) dbstore2002 done: ``` **new tables** comment image_comment_temp revisio... [15:52:02] Hi Amir1: I cannot read the revision 170655496 of dewiki by sql - what's going on? The revision id is true and this page has contents. [15:52:53] cannot read any other newer revisions from db too [15:53:07] doctaxon: where you're running the query against? [15:53:17] dewiki [15:53:32] I mean in labs or tin/terbium? [15:53:42] labs [15:54:20] okay, let me check [15:54:30] what is tin/terbium? [15:54:55] select page_title from page join revision on rev_page = page_id where rev_id = 170655496; [15:56:26] doctaxon: I checked and it seems the replicas in labs are lagged [15:56:34] the main dbs are okay [15:56:38] https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=5&fullscreen&orgId=1&from=now-3h&to=now [15:57:08] please unlag the replicas [15:58:13] doctaxon: https://tools.wmflabs.org/replag/ [15:58:25] It has half an hour of lag [15:58:46] It's not easy to unlag them and the only way for now is that people write less to the database [15:59:27] in case of dewiki (s5), it'll be great if dewiki and/or wikidatawiki edit less (for now) [15:59:42] DBAs are moving dewiki out of s5, it'll help but that's long term [16:03:09] Amir1: you shovel dewiki from s5 to s3 right now? [16:03:38] what? [16:04:03] you move dewiki from s5 to s3 right now? [16:05:47] I don't think dewiki is going anywhere [16:06:31] not now, this is weeks of work by experts [16:06:32] not me [16:07:09] Reedy: There is a task for that, That's the plan (I think in Q3) [16:07:25] probably a dedicated shard or s7 [16:07:34] Why would you move dewiki from s5? [16:07:41] I know wikidata is going [16:07:41] because of wikidatawiki [16:08:24] maybe I misread but wikidata is staying or dewiki gets the shard, but that's not really different [16:08:48] dewiki isn't gonna be going to s3... that's for sure [16:09:09] definitely, s3 is already exploding, thanks to cebwiki [16:09:39] for dewiki, I think s7 or a dedicated shard [16:10:26] well, it was dedicated until it was decided putting wikidata there was a good idea :P [16:11:24] https://phabricator.wikimedia.org/T177208 [16:11:29] That suggests wikidata is moving [16:11:31] that was a ... brilliant... idea [16:11:43] It made sense years ago :) [16:11:52] No one expected the growth that it got [16:12:00] And at the time... It wasn't gonna get dedicated db hardware [16:12:07] Not with the budget and DBA capacity [16:12:23] Thanks, I thought it's the other way around [16:12:34] Unless I'm mistaken.. It was on s3 for a bit [16:17:27] yes, wikidatawiki lagged dewiki for many months [16:18:18] and such bot scripts like quickstatements made it more and more laggier [17:39:05] dewiki is staying in s5 and wikidata will go to the future s8 [17:40:35] once s5 will only host dewiki, we will see how we move things around to get a better resource usage of all the s5 servers [17:57:24] 10DBA, 10Data-Services, 10Epic: Database replica drift on web and analytics clusters - https://phabricator.wikimedia.org/T179767#3735419 (10Dispenser) [18:28:14] 10DBA, 10Data-Services, 10Epic: Database replica drift on web and analytics clusters - https://phabricator.wikimedia.org/T179767#3735453 (10Marostegui) I have checked and it is a valid drift (it is present on sanitarium and all the labs hosts). I guess this comes from the marathon of fixing image table that... [18:29:27] 10DBA, 10Data-Services, 10Epic: Database replica drift on web and analytics clusters - https://phabricator.wikimedia.org/T179767#3735460 (10MZMcBride) From : ``` 13:41, 4 August 2015 1Veertje (talk | cont... [18:32:01] 10DBA, 10Data-Services, 10Epic: Database replica drift on web and analytics clusters - https://phabricator.wikimedia.org/T179767#3735461 (10MZMcBride) ``` MariaDB [commonswiki_p]> select @@hostname; +------------+ | @@hostname | +------------+ | labsdb1003 | +------------+ 1 row in set (0.00 sec) MariaDB [c... [18:32:27] 10DBA, 10Data-Services, 10Epic: Database replica drift on web and analytics clusters - https://phabricator.wikimedia.org/T179767#3735462 (10Marostegui) >>! In T179767#3735460, @MZMcBride wrote: > From : >... [18:32:55] marostegui: It seems a bit weird that an image deleted/moved in 2015 would be in the new replicas. [18:33:35] I assume labsdb1011 is new/using row-based replication, anyway. [18:34:14] 10DBA, 10Data-Services, 10Epic: Database replica drift on web and analytics clusters - https://phabricator.wikimedia.org/T179767#3735463 (10Marostegui) >>! In T179767#3735461, @MZMcBride wrote: > ``` > MariaDB [commonswiki_p]> select @@hostname; > +------------+ > | @@hostname | > +------------+ > | labsdb10... [18:34:35] Ah. [18:35:36] Replica drift drives me nuts. :-( [18:36:32] Esther: why is it weird that it is on the new replicas? if it was copied from production [18:37:12] marostegui: I didn't realize production had been so busted, I guess. [18:37:25] yeah, there were lots of inconsistencies there [18:37:29] Usually images deleted in 2015 wouldn't be in 2017 replicas. :-) [18:37:46] yeah, but if those replicas were built from prodution servers before we got to fix them [18:37:47] I'm glad it's getting reconciled. [18:37:51] Sure. [18:38:07] Has there been any work done toward doing automated sync checks between production and replicas? [18:38:15] To detect drift. [18:38:40] These types of bugs are frustrating as a volunteer developer. They always suck up time and energy to track down. [18:38:54] Nope, not yet. Row based replication should be good for recent data [18:39:15] It is not easy to automate that sync :-( [18:39:43] Sure, but I'm more interested in reporting over fixing. [18:39:58] Like just knowing that there's drift between commonswiki.image and commonswiki_p.image would be nice. [18:40:16] Yeah, that is what I meant, that it is not easy to report it [18:40:18] Instead of some tool breaking and someone going to investigate and needing to find someone to query production. [18:40:31] Bothering poor Reedy. [18:40:55] Who doesn't like bothering Reedy? [18:40:59] Even having a script that selected 100 random rows and compared between the two would be nice. [18:41:00] I certainly do! [18:41:48] There are some tables that we would not be surprised if they have some drifts, and image table is one of them [18:41:51] :-( [18:42:00] As it had looots of drifts within production itself [18:42:07] image should be fairly stable and consistent, shouldn't it? [18:42:09] Hmmm. [18:42:17] It shouldn't be as hot as recentchanges or revision or similar. [18:42:23] But I guess images are a mess. [18:42:31] Apparently there were some issues some years ago with it (i don't know all the details) [18:42:40] Bugs? In MediaWiki? :o [18:42:48] No, i think more server crashes [18:42:50] and stuff like that [18:42:53] Ah, that too. [18:43:11] And then you copy data from one server to another, then another server to another one and you end up with some messy stuff [18:43:15] Are images still in Swift? [18:43:39] Jaime spent looots of hours fixing the image table and it is now a lot better in production (ie: this row only exists on labs, but not on production, on any server) [18:43:50] So i think it is just a leftover from that clean up [18:44:08] Going to grab some dinner now! [18:44:12] Good eats. [18:44:12] Thanks for reporting it [18:44:21] I will try to get it fixed during the week [18:44:28] Cool, thanks. :-) [18:59:39] 10DBA, 10Data-Services: Database replica drift on web and analytics clusters - https://phabricator.wikimedia.org/T179767#3735486 (10Marostegui) [19:09:18] 10DBA, 10Data-Services: Database replica drift on web and analytics clusters - https://phabricator.wikimedia.org/T179767#3735491 (10Marostegui) Same thing happens with the other zombie file: ``` SELECT * FROM image WHERE img_name = "Is-sur-Tille_Motocross_finale_du_championnat_de_France_féminin_2015_-_Erell_B... [19:10:09] 10DBA, 10Data-Services: Database replica drift on web and analytics clusters - https://phabricator.wikimedia.org/T179767#3735493 (10Marostegui) [19:10:22] 10DBA, 10Data-Services: Database replica drift on web and analytics clusters - https://phabricator.wikimedia.org/T179767#3735419 (10Marostegui) [19:15:33] 10DBA, 10Data-Services: Database replica drift on web and analytics clusters - https://phabricator.wikimedia.org/T179767#3735497 (10Marostegui) I have backuped those two rows just in case on the sanitarium host: ``` root@db1102:~/T179767# pwd /root/T179767 ``` [19:46:52] marostegui: is something like pt-table-checksum something that we can run regularly? [19:56:14] legoktm: that is what we used to do, but it takes quite long, doesn't work for the tables without PK [19:56:32] We are now doing a different approach, using mydumper+diff [19:56:43] it is also less risky as it is a select and not an INSERT [19:57:09] With pt-table-checksum we have to put some filters in place, control lag, make sure big tables are not getting stuck etc... [19:57:15] It was the first approach we took [19:57:19] But it was really painful [19:57:36] Also we had lots of tables without PK, and it took around a year to fix those... [19:59:33] ok :/ [19:59:44] so is it not something that can be automated then? [20:04:04] it could be, but it is a complex thing [20:04:25] Specially because replication keeps flowing [20:05:03] Not that you can just stop replication, and if you can't you can also have data drifts because of replication [20:05:10] Which are false positives :-) [20:05:18] (or could hide real issues) [22:06:33] Are images still in Swift? [22:06:36] AFAIK yes [22:27:31] 10DBA, 10Wikidata: Migrate wb_items_per_site to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114904#3735651 (10Multichill) >>! In T114904#3732816, @daniel wrote: > @Multichill wb_items_per_site is *always* items. So to get the full item ID, just use concat('Q', ips_ite...