[02:02:47] <wikibugs_>	 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "2062 Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3671004 (10Catrope) Thanks @jcrespo , @Reedy and @Ladsgr...
[05:30:31] <wikibugs_>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3671101 (10Marostegui)
[06:30:38] <wikibugs_>	 10DBA, 10Analytics: Drop MoodBar tables from all wikis - https://phabricator.wikimedia.org/T153033#3671151 (10Marostegui)
[06:55:50] <wikibugs_>	 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3671159 (10Marostegui)
[07:20:01] <wikibugs_>	 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3671181 (10jcrespo) Sure, it is ok.
[07:22:03] <wikibugs_>	 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "2062 Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3671182 (10jcrespo) @Catrope- discussion ongoing, feel f...
[08:05:12] <jynus>	 did you saw my comment about enwiki checksums?
[08:05:33] <marostegui>	 yep :(
[08:06:00] <marostegui>	 and i was thinking that for a general archive (only, at least) table check, pt-table-checksum might be faster than the mydumper
[08:06:04] <marostegui>	 to compare all hosts at once
[08:07:11] <jynus>	 yes, that could work, at least for non-labsdb hosts
[08:07:23] <jynus>	 archive is probably a not so large table
[08:07:42] <marostegui>	 around 20G
[08:07:52] <marostegui>	 So we could just check it on the core hosts, just that table
[08:07:56] <jynus>	 not tiny
[08:08:02] <jynus>	 but at least it is not revision
[08:08:06] <marostegui>	 to get an idea of how those look like
[08:13:35] <wikibugs_>	 10DBA, 10Patch-For-Review: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807#3671205 (10Marostegui) Let's run an pt-table-checksum for the following hosts (only core):  ``` db2016.codfw.wmnet db2034.codfw.wmnet db2042.codfw.wmnet db2055.codfw.wmnet db2062.codfw.wmnet db2069.codfw...
[08:19:37] <wikibugs_>	 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "2062 Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3671212 (10jcrespo) I forgot to say we suspect the same...
[08:27:03] <jynus>	 so this is the thing: with lagging control, I think it will take 6 hours or more to purge the 6 million records
[08:27:17] <marostegui>	 not too bad
[08:27:35] <marostegui>	 where are you purging? dbstore1002 for now?
[08:27:43] <jynus>	 after testing on dbstore1002, I am quite confident on running it on production
[08:28:00] <jynus>	 commonswiki_test_T177772.recentchanges
[08:28:03] <jynus>	 on dbstreo1002
[08:28:34] <jynus>	 so I am going to run it from db1068, beacause it has an archiving to file functionality
[08:28:47] <jynus>	 so if we happen to delete something we didn't intend, it is local to the master
[08:28:58] <jynus>	 aside from that, I will create a backup now from the master
[08:28:59] <marostegui>	 keep in mind that tomorrow we are stopping mysql on db1068
[08:29:01] <marostegui>	 for the mysql upgrade
[08:29:04] <jynus>	 oh
[08:29:13] <jynus>	 well, it can be stopped at any time
[08:29:21] <marostegui>	 sure, just letting you know :-)
[08:29:23] <jynus>	 so not a blocker
[08:29:37] <jynus>	 I will check lag with db1053
[08:29:49] <jynus>	 I will leave a screen on db1068
[08:30:09] <jynus>	 and it can just cntr-c at any time, it does 1commit per row 
[08:30:16] <jynus>	 so no problem there
[08:30:21] <jynus>	 ok with the plan?
[08:30:43] <marostegui>	 yeop
[08:30:45] <marostegui>	 yeah
[08:30:46] <marostegui>	 fine by me
[08:30:53] <marostegui>	 I will add it to the procedure for tomorrow
[08:30:57] <jynus>	 I will stopit before the night
[08:30:59] <marostegui>	 to stop it
[08:31:00] <marostegui>	 ah
[08:31:01] <marostegui>	 ok :)
[08:31:17] <jynus>	 so the buffer pool can be emptied before the stop quickly
[08:31:49] <marostegui>	 yeah :)
[08:31:54] <jynus>	 we should add the : dump the buffer pool, set max dirty to 10, stop without dumping the buffer pool
[08:32:21] <jynus>	 ahead of the stop
[08:32:29] <jynus>	 we can talk later on details and prep
[08:32:33] <marostegui>	 yeah
[08:32:47] <jynus>	 I would normally wait for this until thursday
[08:32:59] <jynus>	 but honestly think we cannot wait for so long
[08:33:15] <marostegui>	 yeah, no problem really
[08:33:21] <marostegui>	 let's start it today
[08:33:28] <marostegui>	 also thursday is a public holiday so it is a "lost" day
[08:49:14] <wikibugs_>	 10DBA, 10Patch-For-Review: Productionize 22 new codfw database servers - https://phabricator.wikimedia.org/T170662#3671256 (10Marostegui)
[09:02:45] <jynus>	 I am running the backup on db1068 now
[09:03:42] <marostegui>	 \o/
[09:56:10] <jynus>	 hoo, Amir1 anyone around?
[09:57:08] <jynus>	 I am going to start purging commonswiki, doing slowly to avoid issues, based on key "rc_source = 'wb'"
[09:57:58] <jynus>	 I think hoo said that was correcty, but would want to double confirm those are the right filtering parameters
[09:58:25] <jynus>	 at least, according to numbers seems about right, also based on not having any of those recently
[10:34:04] <Amir1>	 jynus: I'm around now if you need help
[10:34:31] <jynus>	 can you see my question^
[10:34:48] <jynus>	 confirm that "rc_source = 'wb'" is right?
[10:56:05] <Amir1>	 jynus: yes that's right
[10:56:22] <Amir1>	 please ping me, I get distracted easily in the office :D
[11:16:50] <jynus>	 thanks
[11:19:56] <wikibugs_>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3671634 (10Marostegui)
[11:22:18] <wikibugs_>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3671637 (10Marostegui) p:05Triage>03Normal
[12:41:48] <wikibugs_>	 10DBA, 10monitoring, 10Epic, 10Wikimedia-Incident: Reduce false positives on database pages - https://phabricator.wikimedia.org/T177782#3671818 (10mark) > Single server issues should not page- Mediawiki should be reliable enough so that if a single server starts lagging or disapperas, it should use the res...
[13:17:26] <wikibugs_>	 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "2062 Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3671914 (10Reedy) >>! In T171027#3671004, @Catrope wrote...
[14:14:57] <wikibugs_>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3672104 (10Marostegui)
[14:39:49] <wikibugs_>	 10Blocked-on-schema-change, 10DBA, 10Patch-For-Review: Drop now redundant indexes from pagelinks and templatelinks - https://phabricator.wikimedia.org/T174509#3672180 (10Marostegui)
[14:47:22] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672205 (10Marostegui)
[14:48:32] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672207 (10Marostegui) And one of them failed already: T177844  ```          physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)       physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS,...
[14:48:47] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672201 (10Marostegui) This is being handled at: T177720
[14:49:00] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672215 (10Marostegui)
[14:49:02] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui)
[14:49:45] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) 05duplicate>03Open
[14:50:01] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui)
[14:50:03] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: Degraded RAID on db2038 - https://phabricator.wikimedia.org/T177844#3672201 (10Marostegui)
[14:52:45] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672225 (10Marostegui) @Papaul let us know if you were able to find disks to replace the (now) broken one and the one that will soon fail. Thanks!
[14:53:34] <jynus>	 the old es server may have the same disk, if not, probably db2028,29
[14:54:09] <marostegui>	 db2010 does
[14:55:22] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672230 (10Marostegui) @Papaul db2010 which is scheduled for decommissioning (T175685) has the same chassis, so maybe it also has the same disks?
[14:55:47] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672232 (10Papaul) @Marostegui  I have some 600GB 15k that I can pull out off db2025. Just keep in  mind that those are Dell disks
[14:57:03] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672234 (10Marostegui) If db2025 is decommissioned, I would say let's go ahead...
[14:57:19] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3672236 (10Cmjohnson) The part is backordered, I will update ticket as soon I see it's shipped.
[14:57:51] <wikibugs_>	 10DBA, 10Operations, 10ops-eqiad: Degraded RAID on db1092 - https://phabricator.wikimedia.org/T177264#3672242 (10Marostegui) Thank you!
[15:00:02] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672250 (10Marostegui) Btw, let's change just one disk at the time.
[15:00:45] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672264 (10Papaul) ok I i will replaced first slot 1
[15:01:17] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672266 (10Marostegui) Sounds good - thank you
[15:04:35] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672273 (10Papaul) Complete. Let me know when ready for slot 7
[15:05:24] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672276 (10Marostegui) Thanks, RAID rebuilding now: ```       logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 1% complete)        physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)...
[15:08:11] <jynus>	 BTW, s4 replication performance improved consirebly in the last day
[15:08:24] <jynus>	 see replication graphs for dbstore1001
[15:09:38] <marostegui>	 why would that be?
[15:10:20] <jynus>	 les rc writes?
[15:10:44] <marostegui>	 ah yeah, indeed
[15:11:46] <Reedy>	 fucking lol
[15:11:57] <Reedy>	 that's kinda depressing
[15:12:14] <jynus>	 well, it is not that visible
[15:12:26] <jynus>	 but we had some nodes 1 week behind
[15:12:32] <jynus>	 so it is a lot les iops
[15:12:40] <jynus>	 for them
[15:14:35] <jynus>	 see -ops recovery
[15:15:10] <marostegui>	 <3
[15:18:03] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) @Papaul the rebuild for that disk has failed - can we try another spare disk maybe?
[15:18:56] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672357 (10Papaul) ok
[15:21:56] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672364 (10Papaul) done
[15:22:39] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672365 (10Marostegui) Here we go again: ```       logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 0% complete)        physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 600 GB, OK)       physica...
[15:25:40] <jynus>	 I've upgraded 10.1.28 for jessie
[15:25:49] <jynus>	 *uploaded
[15:34:21] <jynus>	 Reedy, it is early to say, but it seems there is no longer insert spikes of 100x the normal rate: https://grafana.wikimedia.org/dashboard/db/mysql?panelId=2&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1068&var-port=9104&from=now-7d&to=now
[15:34:31] <Reedy>	 :)
[15:50:30] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3672446 (10Papaul)
[15:51:32] <jynus>	 db2051 getting some lag
[15:51:59] <jynus>	 I may have to use it as our slowest slave
[15:53:13] <jynus>	 I think that works better, but lag purge will be even slower :-(
[15:55:09] <jynus>	 probably too slow :-/
[15:55:35] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw, 10Patch-For-Review: Decommission db2010 and move m1 codfw to db2078 - https://phabricator.wikimedia.org/T175685#3672457 (10Papaul) a:05Papaul>03RobH
[15:56:00] <jynus>	 I think I am going to disable checking cross-dc and not wait for replication oncodfw
[15:58:07] <jynus>	 I am just going to do that
[16:08:37] <wikibugs_>	 10DBA, 10MediaWiki-Watchlist, 10Wikidata: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772#3672475 (10jcrespo) I have allowed for codfw to lag- so that we can go at around 500 deletes/s. That...
[16:12:50] <wikibugs_>	 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3672483 (10hoo) I'll enable the tracking on the two wikis (`cawiki`, `cewiki`) tomorrow then.  Btw, the estimate for k...
[16:29:57] <wikibugs_>	 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3672577 (10jcrespo) 51.5M you meant, maybe?
[16:34:21] <wikibugs_>	 10DBA, 10MediaWiki-extensions-WikibaseClient, 10Wikidata, 10Patch-For-Review, and 2 others: Usage tracking: record which statement group is used - https://phabricator.wikimedia.org/T151717#3672617 (10hoo) >>! In T151717#3672577, @jcrespo wrote: > 51.5M you meant, maybe?  No for kowiki we're indeed talking...
[16:53:41] <jynus>	 1380000 rows deletes so far in less than 1 hour
[16:53:58] <marostegui>	 wow that is pretty good
[16:54:19] <jynus>	 the deletes are easy, it is the defrag that will take more involvement
[16:58:09] <jynus>	 disk space consumption seems ok
[17:11:06] <wikibugs_>	 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3672802 (10Framawiki)
[17:20:09] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3672822 (10Marostegui) @Papaul the disk went fine, can you change the other one pending now? ```       logicaldrive 1 (3.3 TB, RAID 1+0, OK)        physicaldrive 1I:1:1 (port 1I:box 1:bay 1,...
[17:28:07] <wikibugs_>	 10DBA, 10Operations, 10ops-codfw: db2038 two disks with predictive failure - https://phabricator.wikimedia.org/T177720#3667836 (10Marostegui) I can see it is rebuilding now - thanks! ```        logicaldrive 1 (3.3 TB, RAID 1+0, Recovering, 0% complete)        physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 6...
[17:38:12] <jynus>	 I will stop the purge process later
[17:38:56] <marostegui>	 cool
[17:38:57] <marostegui>	 thanks
[18:33:31] <wikibugs_>	 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3673060 (10Lydia_Pintscher) This is a hugely political issue....
[20:30:51] <wikibugs_>	 10DBA, 10Wikidata: Migrate wb_terms to using prefixed entity IDs instead of numeric IDs - https://phabricator.wikimedia.org/T114903#3673756 (10Ladsgroup)
[20:50:10] <wikibugs_>	 10DBA, 10Commons, 10Contributors-Team, 10MediaWiki-Watchlist, and 12 others: "Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis - https://phabricator.wikimedia.org/T171027#3673815 (10Risker) >>! In T171027#3673060, @Lydia_Pintscher w...
[22:06:44] <wikibugs_>	 10DBA, 10MediaWiki-Watchlist, 10Wikidata: Purge 90% of rows from recentchanges (and posibly defragment) from commonswiki and ruwiki (the ones with source:wikidata) - https://phabricator.wikimedia.org/T177772#3674015 (10jcrespo) > That means the whole thing will take less than 3 hours  I had a mind slip... we...