[01:08:35] 10DBA, 10Data-Services: Missing data in database replicas - https://phabricator.wikimedia.org/T254193 (10bd808) >>! In T254193#6183562, @AntiCompositeNumber wrote: > The issue seems to be roughly constant per day over the past month (https://quarry.wmflabs.org/query/45496). Doesn't really help figure out when... [01:29:09] 10DBA, 10Data-Services: Missing data in database replicas - https://phabricator.wikimedia.org/T254193 (10AntiCompositeNumber) Doh! That should be `rc_this_oldid` not `rc_cur_id`. `rc_cur_id` is a key to `page_id`, not `rev_id`. https://quarry.wmflabs.org/query/45496 and the other queries should be correct now... [01:36:59] 10DBA, 10Data-Services: Missing data in database replicas - https://phabricator.wikimedia.org/T254193 (10bd808) Focusing on the [[https://es.wikipedia.org/w/index.php?title=Nocem_Collado&action=history|Nocem_Collado]] page from P11348, I see this on the wiki replicas: `name=analytics cluster,lang=sql,lines=10... [05:05:32] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) I have stopped mysql on db1141 to take a binary backup to it and it is being copied to: `backup1002:/srv/backups/T249188/ongoing/db... [05:25:41] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) >>! In T238966#6182065, @M... [07:09:19] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) @Jclark-ctr db1138 is now off and ready for you to change the memory whenever you get to the DC. Once you are done, please power the host back o... [07:13:58] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10Nikerabbit) > NameTableStore, added in 1.31, provides a convenient way of handling pseudo-enums from the PHP side. As far as I can see, only core c... [07:28:34] 10DBA, 10Data-Services: Missing data in database replicas - https://phabricator.wikimedia.org/T254193 (10Marostegui) I just did a quick check on `eswiki` and `fawiki` and the number of rows is exactly the same in production than in sanitarium (master for labsdbhosts) and on labsdb1012 ` root@cumin1001:/home/ma... [09:01:16] 10DBA, 10Data-Services: Missing data in database replicas - https://phabricator.wikimedia.org/T254193 (10jcrespo) I made some deeper research and this is a misunderstanding of how recentchanges and revision tables work. While at first one could think that recentchanges is just a smaller version of the revision... [09:24:52] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10Tgr) NameTableStore itself can be used just fine, it doesn't need any extension points. NameTableStoreFactory can only be used by core (that should... [10:12:09] 10DBA: Productionize db114[1-9] - https://phabricator.wikimedia.org/T252512 (10Marostegui) [10:24:33] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10Ladsgroup) >>! In T119173#6183887, @Nikerabbit wrote: >> NameTableStore, added in 1.31, provides a convenient way of handling pseudo-enums from the... [10:35:32] 10DBA: Drop wb_terms in production from s4 (commonswiki, testcommonswiki), s3 (testwikidatawiki), s8 (wikidatawiki) - https://phabricator.wikimedia.org/T248086 (10JAllemandou) [11:01:05] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10Ladsgroup) Clarification: I think we need to decide whether we need another table or hardcoded numbers in the code on case-by-case basis but we shou... [11:10:05] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10Nikerabbit) >>! In T119173#6184431, @Ladsgroup wrote: >>>! In T119173#6183887, @Nikerabbit wrote: >>> NameTableStore, added in 1.31, provides a conv... [11:22:57] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10Ladsgroup) >>! In T119173#6184593, @Nikerabbit wrote: > I see, you are constructing the class directly. I wonder how this is going to work with the... [11:47:06] bad news [11:47:14] I stopped mysql on db1141 to copy it to backup1002 [11:47:23] and after the restart....crashes [11:47:30] so it is not host specific [11:47:45] and not data specific [11:53:14] I will update the task and the email thread after lunch [11:56:35] marostegui: oh dear [11:57:46] it hasn't really crashed yet, but it is showing all the InnoDB errors we've seen before the crashes on labsdb1011 [11:58:08] so this sounds like a 10.4 bug? [11:58:15] god knows [11:58:20] this multisource hosts are super special [11:58:28] lots of data, lots of queries, lots of lag, lots of killed queries... [11:58:43] Going for lunch, will be back in a couple of hours or so [12:08:01] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), 10Schema-change: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10daniel) Thanks for testing this! > Th... [12:44:50] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) Bad news, db1141 is showing InnoDB errors after the restart to copy the data to backup1002 before proceeding with labsdb1010. It ha... [12:46:05] :-( [12:46:16] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['labsdb1011.eqiad.wmnet'] ` The l... [12:46:32] https://jynus.com/gif/dissappointed.gifv [12:48:07] marostegui: are you ready for some more fun? [12:48:37] cdanis: I thought it was at 14:00 utc! [12:49:03] oh, I got it wrong [12:49:50] i can do it now if you like too [12:49:50] I could have gone back to sleep [12:49:55] haha [12:50:06] if you are up already, let's get this over with [12:50:10] alright great [12:50:50] * Reedy waits for marostegui to shout LEEERRRROOOOOOYYYYYYY [12:52:21] no jenkins involving in this procedure [12:52:44] will be when they merge the patch ;) [12:53:02] *involved [12:53:09] I am going to merge the mw-config patch now [12:53:15] ok [12:59:20] 10DBA, 10Cloud-Services, 10CPT Initiatives (MCR Schema Migration), 10Core Platform Team Workboards (Clinic Duty Team), and 2 others: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 (10Marostegui) >>! In T238966#6184701, @daniel... [13:04:21] ok [13:04:28] I am ready [13:04:58] do I understand things correctly that db1112 is the special host that serves contributions queries? [13:05:23] on which wiki do you want to test? [13:05:30] enwiki? :) [13:05:34] then no [13:05:47] we need to look at db1099:3311 and db1105:3311 [13:05:52] oh [13:05:52] "DEFAULT": { [13:05:54] "contributions": { [13:05:56] "db1112": 100 [13:05:57] default == s3 [13:05:58] ohhh that's default not s1 [13:06:03] yep [13:06:04] I am sending traffic [13:06:10] ok, checking [13:06:44] it does look like PoolCounter is working btw [13:08:29] cdanis: I am not seeing anything strange so far [13:08:50] I will increase the concurrency by sending from some more hosts, for different contribution lists [13:08:56] cool [13:11:14] the only thing I see that stands out so far is an increase in 'InnoDB Semaphores' on db1099 [13:11:20] same graph says no data on db1105 [13:11:52] adding some more load [13:12:42] db traffic is increased but ofc you'd expect that [13:12:48] yeah [13:13:02] The servers are still coping well [13:14:41] adding some more load [13:14:46] go for it [13:15:47] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10Tgr) Yeah, we also used it in WikimediaEditorTasks and MachineVision pre-policy. But ultimately, I think, the factory should be fixed. Filed {T25424... [13:18:33] everything still good [13:18:46] It's going to be slightly amusing if Tim was right - https://phabricator.wikimedia.org/T234450#5676385 [13:19:11] haha yes [13:19:34] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['labsdb1011.eqiad.wmnet'] ` and were **ALL** successful. [13:21:51] so I'm hammering several different high-edit-count users with a limit=5000 query, total concurrency 14 [13:22:19] cdanis: yeah, so far the DBs are seeing nothing strange [13:22:40] I saw db1105 getting a bit hang [13:22:43] let me double check that [13:24:08] sorry if I make a stupid question, but are your running the same query that caused the issues? [13:24:13] jynus: yes [13:24:32] ok, I was mentioning it becase there is high variability on rcs options [13:24:48] and performance changes a lot from osing some or other filters [13:24:52] cdanis: I am seeing _some_ lag on db1105, but small [13:24:56] like 1-2 seconds [13:25:01] not constant, but often [13:25:02] repl lag? [13:25:05] yep [13:25:26] shall I add a bit more load? [13:25:27] when the contributions queries arrive [13:26:17] cdanis: give it a bit more [13:26:40] ok [13:27:59] it might have been more 'realistic' to scrape a whole bunch of users, instead of the same several over and over again, but I didn't prepare for that [13:28:23] I am seeing batches of wait MW waiting for replication [13:28:32] where are you looking? [13:28:37] db1105 and db1099 [13:29:11] hey, from labstore1004, we are getting `maintain-dbusers[59949]: pymysql.err.InternalError: (1130, "Host '10.64.37.19' is not allowed to connect to this MariaDB server")` this is contacting labsdb1010/1011 [13:29:13] it seems like there's been an overall increase on writes on enwiki, which I don't think my load is causing [13:29:18] arturo: labsdb1011 is down [13:29:41] \groundhogday{labsdb1011 is down} [13:30:03] but 1012 and 1010 should be fine, no? [13:30:09] arturo: yep [13:30:33] I don't understand why that complain about he missing grant, then [13:31:11] arturo: I cannot check now - will check later or tomorrow [13:31:20] ok thanks [13:31:28] arturo: is this urgent? [13:31:46] I cannot tell right now how urgent is this [13:31:52] not sure what it is impacting [13:32:14] in any case, no enwiki down level urgent [13:32:16] :-) [13:32:19] :) [13:33:17] looks like s1 writes are back down to baseline-ish [13:33:52] cdanis: yeah, not sure if those small pauses were related to these tests [13:33:56] they look better now [13:34:57] how confident are you feeling? [13:35:18] I think this looks okay [13:35:52] I think it looks ok too, but it is hard to catch this [13:35:58] as we saw it wasn't happening all the time [13:36:03] just punctual times from a concrete IP actually [13:36:06] yeah [13:36:22] there's some part where we won't know until it has been running this way for a while [13:36:28] yeah [13:44:34] we could try harder to break things, we could just leave it, we could put back the 500 limit... I'm inclined to do #2 [13:45:54] leave it meaning? [13:46:28] sorry, I mean, continue (from now) allowing limit=5000 on contributions [13:46:47] that was the limit we had originally, no? [13:46:55] originally, pre-October, yeah [13:46:58] (were there any other changes made since October that might have helped resilience here?) [13:47:23] yeah, let's go for that [13:47:27] but let's also be ready to revert [13:47:32] if we see this happening again [13:47:39] I would expect bots to keep trying [13:48:22] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Jclark-ctr) @Marostegui Replaced failed DIMM. host is powered back on [13:48:38] I will increase concurrency just a bit more, for a few minutes, and then call the test successful [13:48:57] ok [13:49:16] 10DBA, 10Operations, 10ops-eqiad, 10Wikimedia-Incident: db1138 (s4 master) crashed due to memory issues - https://phabricator.wikimedia.org/T253808 (10Marostegui) Thank you, I will take it from here [13:51:21] arturo: labsdb1011 is back up with the same grants as the rest of the labsdb hosts [13:54:45] okay, I am going to stop traffic [13:55:54] ok [14:02:55] thanks for the help! [14:03:32] * marostegui crosses his fingers [14:17:28] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1140.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/20200... [14:38:58] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['db1140.eqiad.wmnet'] ` The log can be found in `/var/log/... [14:39:22] hopefully succesful at 10th try [14:42:39] XDD [14:49:04] 10DBA: Productionize db213[6-9] and db2140 - https://phabricator.wikimedia.org/T252985 (10Kormat) [14:55:10] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10jcrespo) a:05Jclark-ctr→03jcrespo Jclark changed the serial port and we now have output on serial console, including post. The other issue was the mac address... [14:57:18] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad, 10Patch-For-Review: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1140.eqiad.wmnet'] ` and were **ALL** successful. [14:58:21] I mean, there was an exception on icinga, but I'll take it [15:19:24] I will reprovision db1140 tomorrow to save some iops [15:20:47] 10DBA, 10Data-Services: Confusion about the relationship between recentchanges and revision history (was: Missing data in database replicas) - https://phabricator.wikimedia.org/T254193 (10bd808) [15:22:46] 10DBA, 10Data-Services: Confusion about the relationship between recentchanges and revision history (was: Missing data in database replicas) - https://phabricator.wikimedia.org/T254193 (10bd808) 05Open→03Invalid Per @jcrespo and T254193#6184133 [15:28:15] 10DBA, 10Data-Services: Confusion about the relationship between recentchanges and revision history (was: Missing data in database replicas) - https://phabricator.wikimedia.org/T254193 (10jcrespo) To summarize what happens: 1) Edits happen on page P 2) Page P is renamed to page Q, a redirect is left behind. A... [16:04:07] 10DBA, 10Patch-For-Review, 10Upstream, 10cloud-services-team (Kanban): Reimage labsdb1011 to Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T249188 (10Marostegui) myloader started on labsdb1011 with 18 threads. [16:13:32] 10DBA, 10MediaWiki-General, 10TechCom-RFC, 10Performance-Team (Radar): RFC: Discourage use of MySQL's ENUM type - https://phabricator.wikimedia.org/T119173 (10daniel) > I see, you are constructing the class directly. I wonder how this is going to work with the new [[ https://www.mediawiki.org/wiki/Stable_i... [16:57:59] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Huji) Adding all three groups. Their review would be the next step. As for Legal an... [17:17:11] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Reedy) I don't imagine it's too much of an issue, offhand. Sure, it'll create an in... [18:02:11] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Aklapper) >>! In T253802#6185700, @Huji wrote: > Adding all three groups. Please al... [18:13:22] 10DBA, 10DC-Ops, 10Operations, 10ops-eqiad: db1140 (backup source) crashed - https://phabricator.wikimedia.org/T250602 (10wiki_willy) Thanks @jcrespo , our documentation looks to be a bit outdated, so we'll get this added in >>! In T250602#6185325, @jcrespo wrote: > Jclark changed the serial port and we... [18:48:18] 10DBA, 10CheckUser, 10Trust-and-Safety, 10WMF-Legal, and 2 others: Set wgCheckUserLogLogins to true on WMF wikis to log successful and unsuccessful login attempts in CheckUser - https://phabricator.wikimedia.org/T253802 (10Huji) Thanks @Aklapper it seems @Reedy has kindly sent the email already, so I will... [18:55:08] 10DBA, 10Data-Services: Confusion about the relationship between recentchanges and revision history (was: Missing data in database replicas) - https://phabricator.wikimedia.org/T254193 (10-jem-) Thanks for your complete explanation, @jcrespo. If I have understood, edits previous to a page move can't be tracked... [19:28:01] 10DBA, 10Data-Services: Confusion about the relationship between recentchanges and revision history (was: Missing data in database replicas) - https://phabricator.wikimedia.org/T254193 (10jcrespo) @-jem-: I think using revision + log would be the best way for that kind of analysis. In any case, there is always...