[02:23:06] fundraising-tech-ops, Operations, netops, Patch-For-Review: Move codfw frack to new infra - https://phabricator.wikimedia.org/T171970#3526886 (ayounsi) some answers from Juniper about the other issues noticed: - Presence of core dumps ``` /var/crash/corefiles: total blocks: 70484 -rw-r--r-- 1 r... [15:35:04] (PS1) Ejegg: Position hosted checkout iframe [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/372157 (https://phabricator.wikimedia.org/T171346) [15:50:15] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1366 [15:53:13] sigh [15:55:15] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1371 [15:55:15] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1606 [16:00:15] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1547 [16:00:16] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2498 [16:00:16] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1815 [16:01:30] failmail party happening in my inbox [16:03:48] cwd do you need me to engage on this ? [16:05:15] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2032 [16:05:15] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2571 [16:05:16] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1722 [16:05:21] sigh [16:05:28] again [16:05:40] i've been working with the DBAs trying to figure out the actual problem [16:05:57] eileen1: but yeah we should kill the procs [16:06:12] dstrine: yeah i get an sms for each one too :) [16:06:42] cwd oh noes [16:06:46] good luck! [16:07:37] eileen1: anthing non-standard going on? [16:10:15] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2276 [16:10:17] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2525 [16:10:17] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1928 [16:10:55] there must be [16:11:07] this looks relatively symmetrical between the servers [16:11:14] must be an extra large data set [16:12:53] cwd: we were running the weekly Big English test for the past hour. It just ended [16:13:05] Don't think we've seen any problems with previous tests though [16:13:08] aaah [16:13:22] i feel like last time there were english banners we saw this [16:15:15] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2463 [16:15:15] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2096 [16:15:16] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2577 [16:17:27] i think what we're gonna have to do is disable certain jobs when english banners are up [16:18:50] cwd like what? [16:19:25] ACKNOWLEDGEMENT - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2463 Casey Dentinger english banners cause replag [16:20:05] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2297 [16:20:15] ACKNOWLEDGEMENT - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2577 Casey Dentinger english banners cause replag [16:20:58] ACKNOWLEDGEMENT - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2297 Casey Dentinger english banners cause replag [16:21:46] ejegg: ah i guess it's just consume and thank [16:22:14] i don't know of a good solution to this unless we can speed that job up [16:22:21] i.e. make it less db intensive [16:23:48] the thing is bandaiding over this is not a good idea, if the db master was to go down right now we would be looking at an hour of permanent data loss [16:23:53] an extremely busy hour [16:24:09] the only safe way to handle it is to not cause replag [16:24:12] cwd but why are the new dbs having so much more trouble with the job than the old dbs? [16:24:32] i'm not sure what you mean? [16:24:34] or did we always have the replag and just no alarms? [16:24:45] the dbs are constantly getting bigger and slower [16:24:49] and the jobs more complex [16:25:22] the jobs haven't gotten any more complex since last year's Big English campaign [16:25:32] we already had the new financial tables in for that one [16:26:00] and Civi upgrades since then haven't changed the inserts significantly [16:26:32] Is it that we upgraded the main server massively, so it can handle a lot faster inserts? [16:26:38] but the replicated ones can't? [16:26:40] well we can try to get the DBAs involved to determine the specific cause [16:27:02] that is possible [16:27:42] 1002 looks about as big 1001 [16:28:03] but replication is different than the way the master runs queries [16:28:28] Can we get an idea for the number of writes they SHOULD be able to handle? [16:28:34] not sure what the unit would be for that [16:28:51] rows/sec of a certain size, with certain indexes, etc? [16:28:54] it really depends on what you are doing [16:29:36] we could look at switching to row based replication [16:29:39] yeah, we just have to isolate whether this is anything app-specific, or whether it can be fixed in db-level config [16:29:53] and since the dbs have changed a hell of a lot more than the app since the last big campaign [16:29:59] i'd first suspect them [16:30:22] what has changed so much about the dbs? [16:30:42] they're on totally new machines now, right? [16:30:52] yeah, if anything they would be faster [16:30:54] different network hardware, etc [16:31:03] right, as long as all the config is optimised right [16:31:09] but it has in fact changed [16:31:22] and we are in fact seeing replication lag issues [16:31:37] that weren't there before [16:32:05] when we switched over, did we do any kind of stress testing like a big english test? [16:32:16] ejegg: cwd I think during big english we turned off the dedupe job for the first few days, but actually I think it was on when we had traffic similar to what we would have seen today [16:32:38] can we put something in the calendar for next week to talk about how we can stress test? [16:33:10] heh, i'd say these 1 hr tests are pretty good stressors [16:33:53] there is also the nature of the fundraising [16:33:59] we are showing more banners more often than ever [16:40:25] anyway i have spent a lot of time trying to narrow down the cause and i will spend a lot more, it is unfortunately very non-obvious what things make dbs slow [16:43:38] Fundraising-Backlog, Wikimedia-Fundraising-Banners: Safari issue with Other amount option in banner: mask currency code letters - https://phabricator.wikimedia.org/T173431#3528117 (MBeat33) [16:48:01] (CR) Mepps: "Hashar, thanks for the review! Please see one question in reply to your comment. Patch with phpcs.xml about to be posted." (1 comment) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps) [16:48:24] (PS3) Mepps: WIP Add phpcs script but times out on test [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) [16:49:01] hey ejegg! [16:49:25] hi mepps! [16:49:30] how was the road trip? [16:49:32] eileen1 I was almost worried you were on this early but then I remembered you're not in new zealand [16:49:54] really great ejegg! except a couple nervous moments haha during encounters with law enforcement [16:50:04] but we made it safe and sound [16:50:06] oh jeez, border crossings are never fun [16:50:17] mepps: yep, still Montreal [16:50:27] yeah i'll tell the story in standup if folks are interested, it's just funny [16:50:27] I leave tomorrow [16:50:44] eileen, cool! yeah not too early in canada :) [16:51:19] I'll probably not do stand up - not really working today (although have looked at translation issues with bgm just now - mostly I played rubber duck) [16:51:19] ejegg been catching up on emails/finishing phpcs task and now trying to regroup back to ingenico/orphan adapter work [16:51:38] i actually pinged you to remember where we were in that work and as i chatted i remembered orphan adapter [16:51:40] eileen1: can you estimate how much the absolute size of the db has increased this year? [16:51:52] eileen1 got it! hope you're having fun up there [16:51:54] hmm - good question - [16:52:08] so the silverpop data is around 24 GB [16:52:20] wow [16:52:23] my guess is the rest of the data would have increased by more than that all up [16:53:01] one factor is that all the db boxes are sitting at close to max memory usage all the time [16:53:07] they each have 512GB ram [16:53:08] (silverpop data is many many rows but each row is v small - other data is say 2 mill contacts with each one having a row in contact, contribution, lie item, etc [16:53:22] hmm - so what affects DB ram usage? [16:53:44] I instantly thought about php loading too much in memory before processing but that is a different box I guess [16:53:44] well i am guessing performance goes to hell the moment you can't fit the whole db in ram [16:54:06] ah - but we can maybe push some tables in ram more than others? [16:54:13] so if we were hitting some sort of absolute size limit [16:54:17] ie. all the log_ tables are really simple writes [16:54:32] yeah, there must be some way to prioritize [16:54:39] (they never do an update or an insert with queries on other tables) [16:55:30] hello MBeat [16:55:35] there is also a bit of a slow select I think I saw sneak back ( a few seconds but needlessly & common, but I feel like that maybe wouldn't contribute?) [16:55:37] hi jessicarobell [16:56:13] We are ready to launch the Brazil campaign, ok for you that we enable? also, heads up dstrine ejegg [16:56:23] out for a bit now - will say hello later [16:56:25] eileen1: yeah selects won't get replicated so it's probably fine [16:56:28] see ya! [16:56:32] totally, thanks for the heads-up [16:56:43] Cool! Thanks MBeat [16:56:54] All good from tech @dstrine [16:56:56] ? [16:58:01] jessicarobell can I have a link for one of the bannes pls? :) [16:58:30] jessicarobell oh its Pats... not sure why my user is coming as guest?? [16:58:38] sure :) https://pt.wikipedia.org/wiki/Apple?banner=B1718_0816_ptBR_dsk_p1_lg_txt_cnt&force=1&country=BR [16:58:46] Hello Pats! [16:58:48] thanks :) [16:59:35] (PS4) Umherirrender: WIP Add phpcs script but times out on test [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps) [16:59:48] (CR) Umherirrender: "Fixed phpcs.xml" [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps) [17:00:16] @ejegg are you around or someone else from tech? :) [17:01:44] (CR) jerkins-bot: [V: -1] WIP Add phpcs script but times out on test [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps) [17:04:56] jessicarobell: hey... [17:05:01] hi jessicarobell ! [17:05:04] what's up? [17:05:56] Hey @ejegg & @AndyRussG just double checking that we are fine to launch the Brazil campaign? (AstroPay) [17:06:01] We are ready to push the button :) [17:07:19] Ah ejegg is definitely more able to answer that than I [17:08:50] Ok, I'll wait for ejegg :) [17:09:22] the db replag is still growing [17:10:19] ejegg: are the queues empty? [17:10:30] (CR) Umherirrender: "It is now running:" [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps) [17:12:35] jessicarobell: can you wait? i am trying to resolve some problems with the database [17:13:35] jessicarobell: everything specific to the d*local integration looks OK to me, so as soon as cwd is happy with the dbs I'm fine launching [17:13:47] cwd looking at queues now [17:13:53] the replag is over 5000s now [17:14:00] (PS2) Ejegg: Position hosted checkout iframe [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/372157 (https://phabricator.wikimedia.org/T171346) [17:14:02] (CR) Mepps: "Umherirrender currently running composer fix. Thanks!" [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps) [17:14:19] oh jeez, that's a lot of seconds [17:15:41] the past behavior has been weird though, seconds behind gets huge but then it catches up suddenly [17:16:23] cwd last queue consumer run was just 1.5 seconds [17:16:28] asking the dbas, we poked at this for hours last time and didn't really get anywhere [17:16:32] so yeah, I guess that queue is emptied [17:16:38] ok sure @cwd & @ejegg [17:16:50] ty jessicarobell, sorry for the inconvenience [17:16:58] ringning @pcoombe for info if you are around [17:17:04] my best guess is: https://ganglia.wikimedia.org/latest/graph.php?h=frdb1002.frack.eqiad.wmnet&m=cpu_report&r=4hr&s=by+name&hc=4&mc=2&st=1502902078&g=mem_report&z=medium&c=Fundraising+eqiad&_=1502903293781 [17:17:21] i notice it has some free memory until the english banners [17:17:25] and has been pegged since [17:17:41] cwd is that the whole civi db cached in ram? [17:17:44] (PS5) Mepps: WIP Add phpcs script but times out on test [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) [17:18:19] not sure, but probably [17:18:26] whatever it is querying heavily [17:18:29] oh wait, green is 'cache' [17:18:42] it's deceptively named, it's pretty much the same thing and used [17:18:51] *as used [17:18:54] does that mean disk cache is 98GB? [17:19:09] and the 400GB is application-level caching? [17:19:15] (CR) Mepps: "Please note WIP. I'm still reviewing changes from composer fix." (1 comment) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps) [17:19:17] https://serverfault.com/questions/85470/meaning-of-the-buffers-cache-line-in-the-output-of-free [17:19:53] hmm [17:20:05] ok, looking at what else is running [17:20:07] it's lower level than application caching [17:20:17] has our mail loader caught up to now yet? [17:20:36] i think so but not positive [17:20:57] huh, well, the recipient load doesn't run at this time of day [17:21:14] and the last time it did run, it took 38 seconds [17:21:37] which is probably the amount of time it takes to make the setup api calls, then wait, then check the FTP site for new stuff [17:22:22] cwd: and ejegg is what you are discussing right now posing any risk to starting a campaign? I'm just trying to get context. [17:22:22] i am seeing some dedupe fail mails too [17:22:52] the 'mailing load' job is also super quick lately [17:22:53] dstrine: yes, the context is that our db copies are having trouble catching up to the master [17:23:08] if the master server was to go down, we would lose all the data they haven't replicated [17:23:16] ok thanks cwd [17:23:34] *why* they are not able to catch up is the million dollar question [17:23:39] fr-tech any news for scrum of scrums? [17:23:52] dun think so [17:24:55] should I mention we might want more dba help? [17:25:19] it's ok, they know [17:25:23] k [17:25:31] they are just super swamped [17:26:01] i'm going to restart mysql on frdev1001 and just see if that fixes it [17:27:01] Sorry, I keep forgetting, qre we doing slow query logging all the time? [17:27:02] heh turn it off and on :P [17:27:16] hehe, yep [17:28:02] ah now that ccogdill is here... have you sent any emails for brazil yet? [17:28:08] tomorrow [17:28:29] ok cool. Things are a little interesting today as you can see ^^^ [17:29:03] cwd and ejegg - will you please let spatton know when we have green light to launch Brazil and also Malaysia and South Africa Mobile banners? I have to jump off for a bit [17:29:18] I do see! [17:29:28] will watch in case they continue to be interesting tomorrow [17:29:40] jessicarobell: will do [17:29:51] wonderful! Thanks all! [17:30:32] well it's catching up now [17:30:35] frdev1001 [17:30:45] the others are still getting worse [17:30:50] huh, like way faster than 5000 sec? [17:30:58] yeah [17:31:11] should be caught up in a couple minutes [17:31:17] restarting mysql is the worst fix ever though [17:31:17] weird [17:32:38] this could be a straight up bug in replication [17:35:15] RECOVERY - check_mysql on frdev1001 is OK: Uptime: 381 Threads: 2 Questions: 1923633 Slow queries: 10 Opens: 270 Flush tables: 1 Open tables: 237 Queries per second avg: 5048.905 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [17:36:30] ejegg|mtg: nothing here, thx!! [17:41:54] XenoRyet did you ever get to look at: https://gerrit.wikimedia.org/r/#/c/370225/ [17:49:52] cwd all clear for the campaigns to start up? [17:50:54] the other 2 slaves are still getting worse [17:51:07] restart 'em both too? [17:51:15] it would be really good to figure out what's going on here and the only chance is when they are broken [17:51:23] We're not dumping any more data into civi as far as I can tell [17:52:15] ok, what can we do to investigate? [17:52:50] ejegg do you know if there's a convention around writing tests for drush scripts? [17:52:59] my belief at this point is that there is a memory leak in the slave thread [17:53:10] i am trying to find any related information [17:53:42] mepps I think best would be to make the drush-specific code as small as possible (just get options) [17:53:55] then put all the real logic into a function that can be tested [17:54:05] like in an actual class [17:54:15] there are tons of instances of this in mysql's past [17:54:26] ejegg cool, that makes sense and i think is close to what i'm doing [17:54:28] and with dependencies provided in constructor [17:54:31] mepps great! [17:54:43] but gotta write the tests :) [17:54:54] sorry, or whatever kind of dependency injection makes the most sense [17:55:07] mepps cool [17:55:42] so, you'll want to use the smashpig TestingCOnfiguration stuff [17:56:01] that'll set up a pending DB in a sqlite in-memory instance [17:56:25] then you can populate that with your testing data [17:57:53] if the rectifier is writing successfully rectified things to queues, the SmashPig testingconfiguration will also set those up with similar backing stores [17:58:01] so you can test what gets sent [17:58:25] cwd so what can we do to look for memory leaks? [17:58:50] search the internet for bugs i guess [17:58:52] cwd can we dump a snapshot to examime offline? [17:58:56] is what i'm trying to do [17:59:04] snapshot of what? [17:59:15] the current state of the machine [17:59:44] so we can restart the real thing and start the campaign [18:01:23] i mean, no there's no way to do that [18:01:26] eh, i guess it's not a setup you can test fixes on without replicating thw state of the whole network [18:01:30] restarting mysql is just kicking the can down the road [18:01:39] and we will hit this again soon [18:01:48] but if the banners are the priority i guess i'll do that [18:03:21] would it help to set up some more monitoring / logging before next wednesday's big english test? [18:03:41] hmmm [18:04:28] well, let me dump some stats anyway and i'll restart the daemons [18:04:51] i think the logging/monitoring is pretty thorough [18:05:15] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 698395 Threads: 1 Questions: 20345006 Slow queries: 3856 Opens: 12097 Flush tables: 1 Open tables: 596 Queries per second avg: 29.131 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [18:05:43] weird [18:05:46] i didn't touch that [18:07:20] huh [18:07:32] i'm going to make a quick store run while traffic's light [18:07:53] cwd would you mind emailing jessicarobell when you're feeling OK about the dbs? [18:08:05] sure [18:08:10] thanks! [18:08:13] np [18:10:22] (CR) Umherirrender: "Big change, hard to review" (25 comments) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps) [18:11:17] mepps: I'm only half-here right now, but yea I took a look and it seems like a good direction to me. [18:11:45] It's still marked WIP, so I didn't give it the full code review, but it looks good. [18:11:48] XenoRyet, thanks! sounds good, yeah you have other things to focus on :) [18:22:00] hey cwd I'm actually here whenever you can give us the green light to enable our new campaigns [18:24:07] spatton: cool, should be shortly` [18:33:53] ugh and of course we don't have codfw monitoring [18:34:27] https://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=cpu_report&c=Fundraising+eqiad&h=frdev1001.frack.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=NOGROUPS [18:34:34] sure looks like a memory leak to me [18:39:08] permalink https://ganglia.wikimedia.org/latest/graph.php?h=frdev1001.frack.eqiad.wmnet&m=cpu_report&r=custom&s=by%20name&hc=4&mc=2&cs=08%2F16%2F2017%2016%3A30&ce=08%2F16%2F2017%2018%3A30&st=1502908611&g=mem_report&z=medium&c=Fundraising%20eqiad [18:40:26] i'm not sure what to say at this point, restarting mysql whenever we are going to run banners is not a solution [18:51:55] * cwd relaxes to soothing crickets [19:04:28] (PS2) AndyRussG: [WIP, pls. don't merge] Clone campaign feature [extensions/CentralNotice] - https://gerrit.wikimedia.org/r/371125 [19:12:53] (PS6) Mepps: WIP Add phpcs script but times out on test [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) [19:13:58] (CR) Mepps: "Good call Umherrinder, will work on removing the array syntax switch to simplify this patch." (1 comment) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps) [19:14:11] (CR) Ejegg: "you might want to look at what I ended up doing in the SmashPig phpcs patch. I added a ton of exceptions to the phpcs.xml file for starter" [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps) [19:15:15] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 7029 [19:16:24] ah ha [19:16:44] ah ha? [19:16:45] or...what's going on there? [19:16:55] that box recovered on its own, and now un-recovered [19:17:13] extremely un-recovered, huh? [19:17:33] yeah [19:17:56] ejegg: https://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=cpu_report&c=Fundraising+eqiad&h=frdev1001.frack.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=NOGROUPS [19:18:09] looks a lot like a memory leak or caching problem in mysql [19:20:15] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4600 [19:20:41] what is shredding the db at this point?? [19:21:19] the saddest part about this is that the most interestingly behaving box (2001) doesn't have monitoring [19:25:15] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 703195 Threads: 2 Questions: 20682272 Slow queries: 3856 Opens: 12335 Flush tables: 1 Open tables: 596 Queries per second avg: 29.411 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [19:28:27] weirder still [19:28:40] jeez [19:29:04] could the 'seconds behind master' metric just be totally off? [19:31:43] or it just doesn't mean what it sounds like it means [19:32:09] ejegg: that ganglia graph shows when i restarted mysql [19:32:56] 2001 caught up all on its own [19:33:15] but then lagged again, i'm guessing whatever buffer is still full [19:33:23] but we don't have monitoring for codfw so who knows [19:41:27] cwd how do you feel about starting the Brazil campaign now? [19:42:26] well i feel like it's unwise to prioritize the banners over the stability of the system [19:43:11] https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=frdb1002.frack.eqiad.wmnet&m=cpu_report&r=year&s=by%20name&hc=4&mc=2&st=1502912172&g=mem_report&z=large&c=Fundraising%20eqiad [19:43:28] https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=frdb1001.frack.eqiad.wmnet&m=cpu_report&r=year&s=by%20name&hc=4&mc=2&st=1502912172&g=mem_report&z=large&c=Fundraising%20eqiad [19:43:38] ok so what happened in late june to cause all the db boxes to be out of memory since then [19:47:26] Fundraising Sprint Outie Inverter, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Patch-For-Review, Unplanned-Sprint-Work: Civi imports currently aren't working. - https://phabricator.wikimedia.org/T172918#3528863 (DStrine) Open>Resolved [19:47:54] Fundraising Sprint Lou Reed, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising-Backlog, and 8 others: Update GC audit to read WX file - https://phabricator.wikimedia.org/T86090#3528867 (Ejegg) Open>Resolved [19:48:23] Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, and 5 others: Resultswitchers: send straight to ty page on reload - https://phabricator.wikimedia.org/T167990#3528877 (Ejegg) Open>Resolved [19:48:25] Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, and 4 others: PayPal EC dead session error - https://phabricator.wikimedia.org/T167923#3528878 (Ejegg) [19:49:10] Fundraising Sprint Baudelaire Bowdlerizer, Fundraising Sprint Costlier Alternative, Fundraising Sprint Deferential Equations, Fundraising Sprint English Cuisine, and 15 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3528888 (DStrine) @Pcoombe can yo... [19:49:11] Fundraising Sprint Outie Inverter, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: Allow use of Civimails - https://phabricator.wikimedia.org/T162747#3528890 (Ejegg) Open>Resolved [19:49:25] Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, and 5 others: Implement Ingenico Connect API call to create hosted payment - https://phabricator.wikimedia.org/T163946#3528891 (Ejegg) Open>Reso... [19:50:15] Fundraising Sprint Judgement Suspenders, Fundraising Sprint Kickstopper, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, and 5 others: retrieve lists of contacts who received a particular mailing - https://phabricator.wikimedia.org/T161762#3528921 (DStrine) @CCogdill_WMF I am... [19:50:23] Fundraising Sprint Judgement Suspenders, Fundraising Sprint Kickstopper, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, and 5 others: retrieve lists of contacts who received a particular mailing - https://phabricator.wikimedia.org/T161762#3528922 (DStrine) Open>Resolv... [19:50:25] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Epic, FR-Email: EPIC: Add key silverpop data into Civi - https://phabricator.wikimedia.org/T161757#3528923 (DStrine) [19:55:50] Fundraising-Backlog, FR-PayPal-ExpressCheckout: Paypal EC - send orphan-y donors to TY page - https://phabricator.wikimedia.org/T173456#3528947 (Ejegg) [19:59:52] just to throw something else out, and admitting that I'm not following all this TOO closely - we were planning a 1-hour Japanese pre-test tonight. Is that a problem? Should we hold it? CC ejegg and cwd [20:00:07] and we have MY & ZA banners up... [20:03:45] ah ha [20:03:49] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-Ingenico, FR-WMF-Audit: Ingenico WX audit parser: refunds missing gateway_txn_id - https://phabricator.wikimedia.org/T173457#3528972 (Ejegg) [20:06:15] Fundraising Sprint Baudelaire Bowdlerizer, Fundraising Sprint Costlier Alternative, Fundraising Sprint Deferential Equations, Fundraising Sprint English Cuisine, and 15 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3528988 (Pcoombe) Open>Res... [20:09:16] Fundraising-Backlog: dress rehearsal for MG event - https://phabricator.wikimedia.org/T173458#3528996 (DStrine) [20:10:32] ejegg, dstrine - are you all in a meeting? i'd be happy to pop in and talk about this [20:10:52] spatton: trying to come up with a reasonable solution [20:10:53] cwd yeah sprint planning [20:11:05] can you spare a few minutes? [20:27:46] Just want to reiterate a point pcoombe made in our FR slack channel, that we defer to your call on timing, cwd ; we can wait til tomorrow if that's the right course. [20:29:55] spatton: thanks! i'm going to execute a very hacking solution shortly but it should get us to a place where it's ok to drive traffic at payments [20:30:03] Fundraising Sprint p 2017, Fundraising-Backlog: dress rehearsal for MG event - https://phabricator.wikimedia.org/T173458#3529037 (DStrine) [20:30:05] Fundraising Sprint p 2017, Fundraising-Backlog, FR-PayPal-ExpressCheckout: Paypal EC - send orphan-y donors to TY page - https://phabricator.wikimedia.org/T173456#3529038 (DStrine) [20:30:06] sweet! [20:30:07] Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog: MY users getting express checkout at some times - https://phabricator.wikimedia.org/T173334#3529039 (DStrine) [20:30:09] Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, FR-Paypal, Patch-For-Review: PayPal audit parser shouldn't send duplicate recurring donations - https://phabricator.wikimedia.org/T172723#3529040 (DStrine) [20:30:12] Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 2 others: Create orphan rectifier for PayPal Express Checkout - https://phabricator.wikimedia.org/T172202#3529042 (DStrine) [20:30:15] Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 2 others: CentralNotice: JS selection widgets no longer work in interface to add a campaign - https://phabricator.wikimedia.org/T172023#3529043 (DStrine) [20:30:17] Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-Ingenico: Deal with recurring donations stuck in 'In Progress' status - https://phabricator.wikimedia.org/T171868#3529044 (DStrine) [20:30:19] Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 2 others: Populate country column when creating c_t rows during offline import - https://phabricator.wikimedia.org/T171658#3529045 (DStrine) [20:30:21] Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, Patch-For-Review: Strict error message in logs (quick tidy up) - https://phabricator.wikimedia.org/T171560#3529046 (DStrine) [20:30:23] Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 3 others: Are PayPal refunds for recurring donations incorrectly being tagged as EC or vice versa? - https://phabricator.wikimedia.org/T171351#3529047 (DStrine) [20:30:25] Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 2 others: Are we losing transactions witih repeated ct_id? - https://phabricator.wikimedia.org/T171349#3529048 (DStrine) [20:30:27] Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, and 4 others: Optimize filesystem access in thank_you email job - https://phabricator.wikimedia.org/T170435#3529050 (DStrine) [20:30:29] Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 2 others: Customize Ingenico hosted checkouts iframe - https://phabricator.wikimedia.org/T171346#3529049 (DStrine) [20:30:32] Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, and 4 others: process-control should not crash on bad utf-8 from stdout or stderr - https://phabricator.wikimedia.org/T167849#3529053 (DStrine) [20:30:34] Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, MediaWiki-extensions-DonationInterface: Update Mastercard logo - https://phabricator.wikimedia.org/T166795#3529054 (DStrine) [20:30:37] Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, and 4 others: CentralNotice: Add controls to purge banner content in Varnish for a specific language - https://phabricator.wikimedia.org/T168673#3529052 (... [20:30:39] Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, and 5 others: Implement Ingenico Connect API calls to get payment status - https://phabricator.wikimedia.org/T163948#3529055 (DStrine) [20:30:41] Fundraising Sprint p 2017, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-2016-17-Q2-Campaign-Support: Periodically run Civi contact import performance tests, note trends - https://phabricator.wikimedia.org/T146338#3529058 (DStrine) [20:30:44] Fundraising Sprint Ivory Tower Defense Games, Fundraising Sprint Judgement Suspenders, Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, and 9 others: Move already-encapsulated useful classes into SmashPig - https://phabricator.wikimedia.org/T163868#3529056 (DStrine) [20:30:46] Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 3 others: Import email-only contacts from 'remind me later' links into CiviCRM - https://phabricator.wikimedia.org/T160949#3529057 (DStrine) [20:30:48] Fundraising Sprint p 2017, Fundraising-Analysis, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: Probably cause of failmail deadlocks - soft credit search pain - https://phabricator.wikimedia.org/T130068#3529059 (DStrine) [20:30:51] Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 5 others: Merge CRM and DonationInterface queue wrappers - https://phabricator.wikimedia.org/T95647#3529060 (DStrine) [20:30:54] Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, and 3 others: Clone button for CN campaigns - https://phabricator.wikimedia.org/T91078#3529061 (DStrine) [20:51:08] pcoombe: spatton we should be good to go on banners [20:51:25] I'll email this too just to be sure someone sees it. [20:53:33] email sent [20:59:56] thanks dstrine [21:03:11] thanks much dstrine! thanks for the hardwork cwd. [21:07:20] heh, the hard work is yet to come :) [21:07:32] :D [21:33:20] BRL card transactions reaching Civi, TY emails going out, ahh [21:52:44] ejegg: so seconds_behind_master just means how long since it's been in sync [21:52:53] nothing to do with how far out of sync it is [21:53:14] 2001 is lagging now [21:53:24] we should see alerts shortly [21:53:53] i do not, however, understand why seconds behind master goes down while it catches up [21:54:28] cwd wait, that seems like the numbers would be more predictable if that was the interpretation [21:55:10] oh right, the increasse is indeed predictable [21:56:55] 2001 caught up again [21:57:02] frdev1001 never lagged [21:58:04] and for the first time today it's going *down* on 1002 [21:58:28] Seconds_Behind_Master: 15642 [21:58:50] hmmm i just thought of something [21:58:55] 1002 is the read server [21:59:06] it could be locking excessively due to reads [22:08:16] so we put banners up and now things are starting to catch up [22:08:18] what the hell [22:09:42] maybe it knocked the clog loose!!! [22:10:01] >:| [22:14:04] hah, weird [22:17:19] https://media.giphy.com/media/DJSLqKN3HQCek/giphy.gif [22:17:55] the best part is i didn't even [22:17:59] i didn't do anything [22:18:36] except observe it [22:18:49] now that's what i call quantum computing volume -INF [22:20:03] scratch that the lag is growing again [22:20:14] PROBLEM - check_procs on frdb1002 is CRITICAL: PROCS CRITICAL: 1201 processes [22:20:28] well that's a new one [22:25:14] RECOVERY - check_procs on frdb1002 is OK: PROCS OK: 203 processes [22:27:04] https://media.tenor.co/images/c942a1c40d796ca485960a2f12766acf/tenor.gif [22:27:20] haha, yes [22:30:14] RECOVERY - check_mysql on frdb1002 is OK: Uptime: 1419622 Threads: 2 Questions: 40706317 Slow queries: 8498 Opens: 24954 Flush tables: 1 Open tables: 606 Queries per second avg: 28.674 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 176 [22:31:00] first time that thing has caught up all day [22:31:16] i saw a couple of dedupe fails just now with: [22:31:19] DB Error: unknown error [23:20:57] Fundraising-Backlog, fundraising-tech-ops: Frack replag master thread - https://phabricator.wikimedia.org/T173472#3529511 (cwdent) [23:21:32] ok i'm going away-ish but will be checking in [23:50:14] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1306 [23:51:33] oh good [23:53:51] i am restarting mysql on that box [23:54:00] * cwd is alert-fatigued [23:55:14] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1424