[02:36:22] 10DBA, 10MediaWiki-Database, 10Operations, 10Performance-Team, and 2 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Krinkle) > The MediaWiki eqiad-appserver cluster **gasping for air**, | {F26543386 height=300} | //[figure 1.](https://grafana.wikimedia.org/dash... [02:37:27] 10DBA, 10MediaWiki-Database, 10Operations, 10Performance-Team, and 2 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Krinkle) [08:33:54] 10DBA, 10MediaWiki-Database, 10Operations, 10Performance-Team, and 2 others: parsercache used disk space increase - https://phabricator.wikimedia.org/T206740 (10Banyek) The binlog purging on codfw was started yesterday (sorry I didn't logged it here), and it runs since; the replication works, and the disk... [09:05:59] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache fiaso post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) [09:26:17] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache fiaso post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) First, the timeline: - Internal traffic starts flowing through eqiad in the interval 14:14:44 - 14:15:03 - External traffic... [09:47:15] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache fiaso post-switchover - https://phabricator.wikimedia.org/T206841 (10akosiaris) [09:49:00] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache fiasco post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) [09:51:36] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) [10:28:34] I prepared some puppet code for putting db2096 in prodiction, but the puppet catalog compiler can't test it, because of some facts are missing. Maybe there's a puppet first run needed or something like that, I'll leave it to Monday. [10:30:44] btw. if somebody read this later: everything is quiet so far. [10:32:07] banyek: it's a new host? [10:32:16] yes it is [10:32:37] then yes, we need to sync it's puppet facts to the compiler, but it requires that puppet run at least once on the host in prod [10:32:40] https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs#FAQ [10:32:48] s/it's/its/ [10:34:06] tx [10:34:47] TL;DR it's not possible to run the compiler on a brand new host before its first puppet run, but if it was provisioned into the spare::system role then youc an sync the facts [10:34:51] and run the compiler [10:35:10] if you sync the puppet facts please !log it for tracking purposes [10:35:29] banyek: let me know if you need anything [10:35:45] else [10:36:06] I won't do that today, I am alone in the DBA team, I just preparing stuff, and keep the fire burning, don't touching anything :) [10:36:30] But thank you the infos [10:37:15] the compiler doesn't run in prod is a CI thing, so not critical at all, feel free to sync the facts if you need to [10:38:34] I got a few more things to check on the change (from Manuel), so I'll fix those first, and then we'll see if I get to the puppet compiler again [10:38:53] ack, no prob :) [10:39:05] :) [13:54:12] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) So first, what I think might be the full root cause of everything: When we switched from codfw to eqiad the parser cach... [14:41:24] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) With memcached wiped clean, and the parsercache databases basically void of useful content, almost all requests needed... [14:45:18] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) At the same time, a higher time for processing a single request meant that even in front of a substantially constant re... [15:05:05] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) As far as MediaWiki fatals go, we had way less issues than one would expect given the graphs above. We had only ~ 1000... [15:05:56] 10DBA, 10Operations, 10Datacenter-Switchover-2018, 10User-Joe: Evaluate the consequences of the parsercache being empty post-switchover - https://phabricator.wikimedia.org/T206841 (10Joe) Overall the absence of any valid parsercache entries can explain all the effects we've seen, except at least partially... [15:18:03] 10DBA, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, 10wikidata-tech-focus: wikibase: synchronize schema on production with what is created on install - https://phabricator.wikimedia.org/T85414 (10WMDE-leszek) `wb_terms_entity_id` only uses the "old", numeric-only `term_entity_id` column, hence it... [15:25:18] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Lea_Lacroix_WMDE) User:Jason.nlw told me that during the effected period... [16:08:37] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) >>! In T206743#4661592, @Lea_Lacroix_WMDE wrote: > User:Jason.... [16:14:08] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Pigsonthewing) I'm pretty sure I added an image: https://commons.wikime... [16:21:32] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) >>! In T206743#4661704, @Pigsonthewing wrote: > I'm pretty sur... [16:22:13] ^^ evilness :( [17:50:09] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) Here is a list of all pages that will likely be affected alrea... [17:59:39] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) Which means there are ~9433 pages that now probably have the w... [18:03:08] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) It looks like we could fix these with the attachLatest.php mai... [18:05:34] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) We are also fully restoring the eqiad hosts from codfw which... [18:23:10] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) >>! In T206743#4661943, @Addshore wrote: > Which means there... [18:24:48] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) I'll run a maint script to fix the page_latest of the 9000ish... [18:28:49] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) >>! In T206743#4662023, @Addshore wrote: > I'll run a maint... [18:35:53] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) >>! In T206743#4662033, @Marostegui wrote: >>>! In T206743#466... [18:38:00] I put the today's log to our etherpad (Line 60-78) [18:39:30] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Addshore) Using the example that I had in T206743#4661943 the pages that... [18:43:30] 10DBA, 10Lexicographical data, 10Wikidata, 10Wikidata-Campsite, and 5 others: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") - https://phabricator.wikimedia.org/T206743 (10Marostegui) ``` root@neodymium:/home/marostegui# ./section s8 | while re... [18:44:19] banyek|away: you deleted all the previous things? :( [18:44:40] Ah no, it is right above [18:45:01] yes, I just inserted it there [18:45:05] banyek|away: I assume not all that will go to Monday's meeting etherpad, right? XD [18:45:29] No, I just wanted to update you about my first day alone :) [18:45:46] yeah [18:45:52] There is a warning about db2050's disk [18:45:55] did you see that? [18:47:12] no :( [18:47:49] And I didn't see it in the host alert history [18:49:19] it happened yesterday 2018-10-11 12:50:43 [18:49:30] Which view do you use? [18:50:11] I normally have this one always open (let me take a screenshot) [18:50:20] reporting -> alert history [18:51:00] Do you see the red box that says critical and has 3 numbers? [18:51:18] click on the right one - that is the not ACK'ed CRITICALs [18:51:25] I always have that one open [18:51:41] And from time to time I click on the WARNING box, on the left one [18:51:48] which is the non ack'ed WARNINGS [18:52:00] < marostegui> click on the right one - that is the not ACK'ed CRITICALs -> not the right one, the most left one [18:52:17] so it is good to keep the non ack'ed criticals always open [18:52:23] and from time to time check the warnings [18:52:34] ok [18:52:40] do you see it? [18:52:47] yes [18:52:51] that looks good [18:52:53] that is a good filtered one [18:52:53] yeah [18:52:57] the others are too noisy [18:52:59] I mean that looks short enough [18:53:03] jinx :D [18:53:06] exactly [18:53:14] so always keep the non acked criticals open [18:53:18] and from time to time check the warnings [18:53:30] Remember you'll be in charge of icinga next week!:) [18:53:32] okay [18:54:17] We _really_ have to get on with the schema change plan [18:54:21] It has been too long already [18:54:42] It is a good excercise to fully understand all the architecture [18:55:15] which will be useful to debug during incidents, as you'll know the architecture and the way all the wikis work [18:55:57] I know. I spent time on that too, but there were a few questions of yours which I couldn't answer [18:56:51] So I wanted to ask about them, or do that with a less tired brain [18:57:50] let's address them on monday [18:59:01] anyways, I am off [18:59:06] Have a good weekend [18:59:07] o/ [18:59:15] addshore: thanks for all the help btw [18:59:23] <3 [18:59:53] bye 👋 [19:03:20] banyek|away: I added a couple of comments to your notes, btw [19:03:28] line 73 you might like :) [19:04:46] Now off for real! [19:04:48] Bye! [19:06:09] marostegui: no :) [19:06:11] *np [19:06:26] It's been a fun week [19:06:27] Hah [21:34:11] 10Blocked-on-schema-change, 10DBA, 10Anti-Harassment, 10Patch-For-Review: Execute the schema change for Partial Blocks - https://phabricator.wikimedia.org/T204006 (10kaldari) @Marostegui - I've heard from multiple people about unexpected fires delaying Ops/DBA work, but no additional information. Is there...