[02:02:47] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "2062 Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3671004 (10Catrope) Thanks @jcrespo , @Reedy and @Ladsgr... [05:30:31] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3671101 (10Marostegui) [06:30:38] 10DBA, 10Analytics: Drop MoodBar tables from all wikis - https://phabricator.wikimedia.org/T153033#3671151 (10Marostegui) [06:55:50] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3671159 (10Marostegui) [07:20:01] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3671181 (10jcrespo) Sure, it is ok. [07:22:03] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "2062 Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3671182 (10jcrespo) @Catrope- discussion ongoing, feel f... [08:05:12] did you saw my comment about enwiki checksums? [08:05:33] yep :( [08:06:00] and i was thinking that for a general archive (only, at least) table check, pt-table-checksum might be faster than the mydumper [08:06:04] to compare all hosts at once [08:07:11] yes, that could work, at least for non-labsdb hosts [08:07:23] archive is probably a not so large table [08:07:42] around 20G [08:07:52] So we could just check it on the core hosts, just that table [08:07:56] not tiny [08:08:02] but at least it is not revision [08:08:06] to get an idea of how those look like [08:13:35] 10DBA, 10Patch-For-Review: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807#3671205 (10Marostegui) Let's run an pt-table-checksum for the following hosts (only core): ``` db2016.codfw.wmnet db2034.codfw.wmnet db2042.codfw.wmnet db2055.codfw.wmnet db2062.codfw.wmnet db2069.codfw... [08:19:37] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "2062 Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3671212 (10jcrespo) I forgot to say we suspect the same... [08:27:03] so this is the thing: with lagging control, I think it will take 6 hours or more to purge the 6 million records [08:27:17] not too bad [08:27:35] where are you purging? dbstore1002 for now? [08:27:43] after testing on dbstore1002, I am quite confident on running it on production [08:28:00] commonswiki_test_T177772.recentchanges [08:28:03] on dbstreo1002 [08:28:34] so I am going to run it from db1068, beacause it has an archiving to file functionality [08:28:47] so if we happen to delete something we didn't intend, it is local to the master [08:28:58] aside from that, I will create a backup now from the master [08:28:59] keep in mind that tomorrow we are stopping mysql on db1068 [08:29:01] for the mysql upgrade [08:29:04] oh [08:29:13] well, it can be stopped at any time [08:29:21] sure, just letting you know :-) [08:29:23] so not a blocker [08:29:37] I will check lag with db1053 [08:29:49] I will leave a screen on db1068 [08:30:09] and it can just cntr-c at any time, it does 1commit per row [08:30:16] so no problem there [08:30:21] ok with the plan? [08:30:43] yeop [08:30:45] yeah [08:30:46] fine by me [08:30:53] I will add it to the procedure for tomorrow [08:30:57] I will stopit before the night [08:30:59] to stop it [08:31:00] ah [08:31:01] ok :) [08:31:17] so the buffer pool can be emptied before the stop quickly [08:31:49] yeah :) [08:31:54] we should add the : dump the buffer pool, set max dirty to 10, stop without dumping the buffer pool [08:32:21] ahead of the stop [08:32:29] we can talk later on details and prep [08:32:33] yeah [08:32:47] I would normally wait for this until thursday [08:32:59] but honestly think we cannot wait for so long [08:33:15] yeah, no problem really [08:33:21] let's start it today [08:33:28] also thursday is a public holiday so it is a "lost" day [08:49:14] 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3671256 (10Marostegui) [09:02:45] I am running the backup on db1068 now [09:03:42] \o/ [09:56:10] hoo, Amir1 anyone around? [09:57:08] I am going to start purging commonswiki, doing slowly to avoid issues, based on key "rc_source = 'wb'" [09:57:58] I think hoo said that was correcty, but would want to double confirm those are the right filtering parameters [09:58:25] at least, according to numbers seems about right, also based on not having any of those recently [10:34:04] jynus: I'm around now if you need help [10:34:31] can you see my question^ [10:34:48] confirm that "rc_source = 'wb'" is right? [10:56:05] jynus: yes that's right [10:56:22] please ping me, I get distracted easily in the office :D [11:16:50] thanks [11:19:56] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3671634 (10Marostegui) [11:22:18] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3671637 (10Marostegui) p:05Triage>03Normal [12:41:48] 10DBA, 10monitoring, 10Epic, 10Wikimedia-Incident: Reduce false positives on database pages - https://phabricator.wikimedia.org/T177782#3671818 (10mark) > Single server issues should not page- Mediawiki should be reliable enough so that if a single server starts lagging or disapperas, it should use the res... [13:17:26] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "2062 Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3671914 (10Reedy) >>! In T171027#3671004, @Catrope wrote... [14:14:57] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3672104 (10Marostegui) [14:39:49] 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3672180 (10Marostegui) [14:47:22] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672205 (10Marostegui) [14:48:32] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672207 (10Marostegui) And one of them failed already: T177844 ``` physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS,... [14:48:47] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672201 (10Marostegui) This is being handled at: T177720 [14:49:00] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672215 (10Marostegui) [14:49:02] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) [14:49:45] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) 05duplicate>03Open [14:50:01] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) [14:50:03] 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672201 (10Marostegui) [14:52:45] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672225 (10Marostegui) @Papaul let us know if you were able to find disks to replace the (now) broken one and the one that will soon fail. Thanks! [14:53:34] the old es server may have the same disk, if not, probably db2028,29 [14:54:09] db2010 does [14:55:22] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672230 (10Marostegui) @Papaul db2010 which is scheduled for decommissioning (T175685) has the same chassis, so maybe it also has the same disks? [14:55:47] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672232 (10Papaul) @Marostegui I have some 600GB 15k that I can pull out off db2025. Just keep in mind that those are Dell disks [14:57:03] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672234 (10Marostegui) If db2025 is decommissioned, I would say let's go ahead... [14:57:19] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3672236 (10Cmjohnson) The part is backordered, I will update ticket as soon I see it's shipped. [14:57:51] 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3672242 (10Marostegui) Thank you! [15:00:02] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672250 (10Marostegui) Btw, let's change just one disk at the time. [15:00:45] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672264 (10Papaul) ok I i will replaced first slot 1 [15:01:17] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672266 (10Marostegui) Sounds good - thank you [15:04:35] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672273 (10Papaul) Complete. Let me know when ready for slot 7 [15:05:24] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672276 (10Marostegui) Thanks, RAID rebuilding now: ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 1% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)... [15:08:11] BTW, s4 replication performance improved consirebly in the last day [15:08:24] see replication graphs for dbstore1001 [15:09:38] why would that be? [15:10:20] les rc writes? [15:10:44] ah yeah, indeed [15:11:46] fucking lol [15:11:57] that's kinda depressing [15:12:14] well, it is not that visible [15:12:26] but we had some nodes 1 week behind [15:12:32] so it is a lot les iops [15:12:40] for them [15:14:35] see -ops recovery [15:15:10] <3 [15:18:03] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) @Papaul the rebuild for that disk has failed - can we try another spare disk maybe? [15:18:56] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672357 (10Papaul) ok [15:21:56] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672364 (10Papaul) done [15:22:39] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672365 (10Marostegui) Here we go again: ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 0% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK) physica... [15:25:40] I've upgraded 10.1.28 for jessie [15:25:49] *uploaded [15:34:21] Reedy, it is early to say, but it seems there is no longer insert spikes of 100x the normal rate: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1068&var-port=9104&from=now-7d&to=now [15:34:31] :) [15:50:30] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3672446 (10Papaul) [15:51:32] db2051 getting some lag [15:51:59] I may have to use it as our slowest slave [15:53:13] I think that works better, but lag purge will be even slower :-( [15:55:09] probably too slow :-/ [15:55:35] 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3672457 (10Papaul) a:05Papaul>03RobH [15:56:00] I think I am going to disable checking cross-dc and not wait for replication oncodfw [15:58:07] I am just going to do that [16:08:37] 10DBA, 10MediaWiki-Watchlist, 10Wikidata: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772#3672475 (10jcrespo) I have allowed for codfw to lag- so that we can go at around 500 deletes/s. That... [16:12:50] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3672483 (10hoo) I'll enable the tracking on the two wikis (`cawiki`, `cewiki`) tomorrow then. Btw, the estimate for k... [16:29:57] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3672577 (10jcrespo) 51.5M you meant, maybe? [16:34:21] 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3672617 (10hoo) >>! In T151717#3672577, @jcrespo wrote: > 51.5M you meant, maybe? No for kowiki we're indeed talking... [16:53:41] 1380000 rows deletes so far in less than 1 hour [16:53:58] wow that is pretty good [16:54:19] the deletes are easy, it is the defrag that will take more involvement [16:58:09] disk space consumption seems ok [17:11:06] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3672802 (10Framawiki) [17:20:09] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672822 (10Marostegui) @Papaul the disk went fine, can you change the other one pending now? ``` logicaldrive 1 (3.3 TB, RAID 1+0, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1,... [17:28:07] 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) I can see it is rebuilding now - thanks! ``` logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 0% complete) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 6... [17:38:12] I will stop the purge process later [17:38:56] cool [17:38:57] thanks [18:33:31] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3673060 (10Lydia_Pintscher) This is a hugely political issue.... [20:30:51] 10DBA, 10Wikidata: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903#3673756 (10Ladsgroup) [20:50:10] 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3673815 (10Risker) >>! In T171027#3673060, @Lydia_Pintscher w... [22:06:44] 10DBA, 10MediaWiki-Watchlist, 10Wikidata: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772#3674015 (10jcrespo) > That means the whole thing will take less than 3 hours I had a mind slip... we...