[02:23:06] <wikibugs>	 fundraising-tech-ops, Operations, netops, Patch-For-Review: Move codfw frack to new infra - https://phabricator.wikimedia.org/T171970#3526886 (ayounsi) some answers from Juniper about the other issues noticed:  - Presence of core dumps ``` /var/crash/corefiles: total blocks: 70484 -rw-r--r--  1 r...
[15:35:04] <wikibugs>	 (PS1) Ejegg: Position hosted checkout iframe [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/372157 (https://phabricator.wikimedia.org/T171346)
[15:50:15] <icinga-wm>	 PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1366
[15:53:13] <cwd>	 sigh
[15:55:15] <icinga-wm>	 PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1371
[15:55:15] <icinga-wm>	 PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1606
[16:00:15] <icinga-wm>	 PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1547
[16:00:16] <icinga-wm>	 PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2498
[16:00:16] <icinga-wm>	 PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1815
[16:01:30] <dstrine>	 failmail party happening in my inbox
[16:03:48] <eileen1>	 cwd do you need me to engage on this ?
[16:05:15] <icinga-wm>	 PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2032
[16:05:15] <icinga-wm>	 PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2571
[16:05:16] <icinga-wm>	 PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1722
[16:05:21] <cwd>	 sigh
[16:05:28] <cwd>	 again
[16:05:40] <cwd>	 i've been working with the DBAs trying to figure out the actual problem
[16:05:57] <cwd>	 eileen1: but yeah we should kill the procs
[16:06:12] <cwd>	 dstrine: yeah i get an sms for each one too :)
[16:06:42] <dstrine>	 cwd oh noes
[16:06:46] <dstrine>	 good luck!
[16:07:37] <cwd>	 eileen1: anthing non-standard going on?
[16:10:15] <icinga-wm>	 PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2276
[16:10:17] <icinga-wm>	 PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2525
[16:10:17] <icinga-wm>	 PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1928
[16:10:55] <cwd>	 there must be
[16:11:07] <cwd>	 this looks relatively symmetrical between the servers
[16:11:14] <cwd>	 must be an extra large data set
[16:12:53] <pcoombe>	 cwd: we were running the weekly Big English test for the past hour. It just ended
[16:13:05] <pcoombe>	 Don't think we've seen any problems with previous tests though
[16:13:08] <cwd>	 aaah
[16:13:22] <cwd>	 i feel like last time there were english banners we saw this
[16:15:15] <icinga-wm>	 PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2463
[16:15:15] <icinga-wm>	 PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2096
[16:15:16] <icinga-wm>	 PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2577
[16:17:27] <cwd>	 i think what we're gonna have to do is disable certain jobs when english banners are up
[16:18:50] <ejegg>	 cwd like what?
[16:19:25] <icinga-wm>	 ACKNOWLEDGEMENT - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2463 Casey Dentinger english banners cause replag
[16:20:05] <icinga-wm>	 PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2297
[16:20:15] <icinga-wm>	 ACKNOWLEDGEMENT - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2577 Casey Dentinger english banners cause replag
[16:20:58] <icinga-wm>	 ACKNOWLEDGEMENT - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2297 Casey Dentinger english banners cause replag
[16:21:46] <cwd>	 ejegg: ah i guess it's just consume and thank
[16:22:14] <cwd>	 i don't know of a good solution to this unless we can speed that job up
[16:22:21] <cwd>	 i.e. make it less db intensive
[16:23:48] <cwd>	 the thing is bandaiding over this is not a good idea, if the db master was to go down right now we would be looking at an hour of permanent data loss
[16:23:53] <cwd>	 an extremely busy hour
[16:24:09] <cwd>	 the only safe way to handle it is to not cause replag
[16:24:12] <ejegg>	 cwd but why are the new dbs having so much more trouble with the job than the old dbs?
[16:24:32] <cwd>	 i'm not sure what you mean?
[16:24:34] <ejegg>	 or did we always have the replag and just no alarms?
[16:24:45] <cwd>	 the dbs are constantly getting bigger and slower
[16:24:49] <cwd>	 and the jobs more complex
[16:25:22] <ejegg>	 the jobs haven't gotten any more complex since last year's Big English campaign
[16:25:32] <ejegg>	 we already had the new financial tables in for that one
[16:26:00] <ejegg>	 and Civi upgrades since then haven't changed the inserts significantly
[16:26:32] <ejegg>	 Is it that we upgraded the main server massively, so it can handle a lot faster inserts?
[16:26:38] <ejegg>	 but the replicated ones can't?
[16:26:40] <cwd>	 well we can try to get the DBAs involved to determine the specific cause
[16:27:02] <cwd>	 that is possible
[16:27:42] <cwd>	 1002 looks about as big 1001
[16:28:03] <cwd>	 but replication is different than the way the master runs queries
[16:28:28] <ejegg>	 Can we get an idea for the number of writes they SHOULD be able to handle?
[16:28:34] <ejegg>	 not sure what the unit would be for that
[16:28:51] <ejegg>	 rows/sec of a certain size, with certain indexes, etc?
[16:28:54] <cwd>	 it really depends on what you are doing
[16:29:36] <cwd>	 we could look at switching to row based replication
[16:29:39] <ejegg>	 yeah, we just have to isolate whether this is anything app-specific, or whether it can be fixed in db-level config
[16:29:53] <ejegg>	 and since the dbs have changed a hell of a lot more than the app since the last big campaign
[16:29:59] <ejegg>	 i'd first suspect them
[16:30:22] <cwd>	 what has changed so much about the dbs?
[16:30:42] <ejegg>	 they're on totally new machines now, right?
[16:30:52] <cwd>	 yeah, if anything they would be faster
[16:30:54] <ejegg>	 different network hardware, etc
[16:31:03] <ejegg>	 right, as long as all the config is optimised right
[16:31:09] <ejegg>	 but it has in fact changed
[16:31:22] <ejegg>	 and we are in fact seeing replication lag issues
[16:31:37] <ejegg>	 that weren't there before
[16:32:05] <ejegg>	 when we switched over, did we do any kind of stress testing like a big english test?
[16:32:16] <eileen1>	 ejegg: cwd I think during big english we turned off the dedupe job for the first few days, but actually I think it was on when we had traffic similar to what we would have seen today
[16:32:38] <eileen1>	 can we put something in the calendar for next week to talk about how we can stress test?
[16:33:10] <ejegg>	 heh, i'd say these 1 hr tests are pretty good stressors
[16:33:53] <cwd>	 there is also the nature of the fundraising
[16:33:59] <cwd>	 we are showing more banners more often than ever
[16:40:25] <cwd>	 anyway i have spent a lot of time trying to narrow down the cause and i will spend a lot more, it is unfortunately very non-obvious what things make dbs slow
[16:43:38] <wikibugs>	 Fundraising-Backlog, Wikimedia-Fundraising-Banners: Safari issue with Other amount option in banner: mask currency code letters - https://phabricator.wikimedia.org/T173431#3528117 (MBeat33)
[16:48:01] <wikibugs>	 (CR) Mepps: "Hashar, thanks for the review! Please see one question in reply to your comment. Patch with phpcs.xml about to be posted." (1 comment) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps)
[16:48:24] <wikibugs>	 (PS3) Mepps: WIP Add phpcs script but times out on test [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314)
[16:49:01] <mepps>	 hey ejegg!
[16:49:25] <ejegg>	 hi mepps!
[16:49:30] <ejegg>	 how was the road trip?
[16:49:32] <mepps>	 eileen1 I was almost worried you were on this early but then I remembered you're not in new zealand
[16:49:54] <mepps>	 really great ejegg! except a couple nervous moments haha during encounters with law enforcement
[16:50:04] <mepps>	 but we made it safe and sound
[16:50:06] <ejegg>	 oh jeez, border crossings are never fun
[16:50:17] <eileen1>	 mepps: yep, still Montreal
[16:50:27] <mepps>	 yeah i'll tell the story in standup if folks are interested, it's just funny
[16:50:27] <eileen1>	 I leave tomorrow
[16:50:44] <mepps>	 eileen, cool! yeah not too early in canada :)
[16:51:19] <eileen1>	 I'll probably not do stand up - not really working today (although have looked at translation issues with bgm just now - mostly I played rubber duck)
[16:51:19] <mepps>	 ejegg been catching up on emails/finishing phpcs task and now trying to regroup back to ingenico/orphan adapter work
[16:51:38] <mepps>	 i actually pinged you to remember where we were in that work and as i chatted i remembered orphan adapter
[16:51:40] <cwd>	 eileen1: can you estimate how much the absolute size of the db has increased this year?
[16:51:52] <mepps>	 eileen1 got it! hope you're having fun up there
[16:51:54] <eileen1>	 hmm - good question -
[16:52:08] <eileen1>	 so the silverpop data is around 24 GB
[16:52:20] <mepps>	 wow
[16:52:23] <eileen1>	 my guess is the rest of the data would have increased by more than that all up
[16:53:01] <cwd>	 one factor is that all the db boxes are sitting at close to max memory usage all the time
[16:53:07] <cwd>	 they each have 512GB ram
[16:53:08] <eileen1>	 (silverpop data is many many rows but each row is v small - other data is say 2 mill contacts with each one having a row in contact, contribution, lie item, etc
[16:53:22] <eileen1>	 hmm - so what affects DB ram usage?
[16:53:44] <eileen1>	 I instantly thought about php loading too much in memory before processing but that is a different box I guess
[16:53:44] <cwd>	 well i am guessing performance goes to hell the moment you can't fit the whole db in ram
[16:54:06] <eileen1>	 ah - but we can maybe push some tables in ram more than others?
[16:54:13] <cwd>	 so if we were hitting some sort of absolute size limit
[16:54:17] <eileen1>	 ie. all the log_ tables are really simple writes
[16:54:32] <cwd>	 yeah, there must be some way to prioritize
[16:54:39] <eileen1>	 (they never do an update or an insert with queries on other tables)
[16:55:30] <jessicarobell>	 hello MBeat
[16:55:35] <eileen1>	 there is also a bit of a slow select I think I saw sneak back  ( a few seconds but needlessly & common, but I feel like that maybe wouldn't contribute?)
[16:55:37] <MBeat>	 hi jessicarobell
[16:56:13] <jessicarobell>	 We are ready to launch the Brazil campaign, ok for you that we enable? also, heads up dstrine ejegg
[16:56:23] <eileen1>	 out for a bit now - will say hello later
[16:56:25] <cwd>	 eileen1: yeah selects won't get replicated so it's probably fine
[16:56:28] <cwd>	 see ya!
[16:56:32] <MBeat>	 totally, thanks for the heads-up
[16:56:43] <jessicarobell>	 Cool! Thanks MBeat
[16:56:54] <jessicarobell>	 All good from tech @dstrine
[16:56:56] <jessicarobell>	 ?
[16:58:01] <Guest58065>	 jessicarobell can I have a link for one of the bannes pls? :)
[16:58:30] <Guest58065>	 jessicarobell oh its Pats... not sure why my user is coming as guest??
[16:58:38] <jessicarobell>	 sure :) https://pt.wikipedia.org/wiki/Apple?banner=B1718_0816_ptBR_dsk_p1_lg_txt_cnt&force=1&country=BR
[16:58:46] <jessicarobell>	 Hello Pats!
[16:58:48] <Guest58065>	 thanks :)
[16:59:35] <wikibugs>	 (PS4) Umherirrender: WIP Add phpcs script but times out on test [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps)
[16:59:48] <wikibugs>	 (CR) Umherirrender: "Fixed phpcs.xml" [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps)
[17:00:16] <jessicarobell>	 @ejegg are you around or someone else from tech? :)
[17:01:44] <wikibugs>	 (CR) jerkins-bot: [V: -1] WIP Add phpcs script but times out on test [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps)
[17:04:56] <AndyRussG>	 jessicarobell: hey...
[17:05:01] <ejegg>	 hi jessicarobell !
[17:05:04] <ejegg>	 what's up?
[17:05:56] <jessicarobell>	 Hey @ejegg & @AndyRussG just double checking that we are fine to launch the Brazil campaign? (AstroPay)
[17:06:01] <jessicarobell>	 We are ready to push the button :)
[17:07:19] <AndyRussG>	 Ah ejegg is definitely more able to answer that than I
[17:08:50] <jessicarobell>	 Ok, I'll wait for ejegg :)
[17:09:22] <cwd>	 the db replag is still growing
[17:10:19] <cwd>	 ejegg: are the queues empty?
[17:10:30] <wikibugs>	 (CR) Umherirrender: "It is now running:" [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps)
[17:12:35] <cwd>	 jessicarobell: can you wait?  i am trying to resolve some problems with the database
[17:13:35] <ejegg>	 jessicarobell: everything specific to the d*local integration looks OK to me, so as soon as cwd is happy with the dbs I'm fine launching
[17:13:47] <ejegg>	 cwd looking at queues now
[17:13:53] <cwd>	 the replag is over 5000s now
[17:14:00] <wikibugs>	 (PS2) Ejegg: Position hosted checkout iframe [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/372157 (https://phabricator.wikimedia.org/T171346)
[17:14:02] <wikibugs>	 (CR) Mepps: "Umherirrender currently running composer fix. Thanks!" [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps)
[17:14:19] <ejegg>	 oh jeez, that's a lot of seconds
[17:15:41] <cwd>	 the past behavior has been weird though, seconds behind gets huge but then it catches up suddenly
[17:16:23] <ejegg>	 cwd last queue consumer run was just 1.5 seconds
[17:16:28] <cwd>	 asking the dbas, we poked at this for hours last time and didn't really get anywhere
[17:16:32] <ejegg>	 so yeah, I guess that queue is emptied
[17:16:38] <jessicarobell>	 ok sure @cwd & @ejegg
[17:16:50] <cwd>	 ty jessicarobell, sorry for the inconvenience
[17:16:58] <jessicarobell>	 ringning @pcoombe for info if you are around
[17:17:04] <cwd>	 my best guess is: https://ganglia.wikimedia.org/latest/graph.php?h=frdb1002.frack.eqiad.wmnet&m=cpu_report&r=4hr&s=by+name&hc=4&mc=2&st=1502902078&g=mem_report&z=medium&c=Fundraising+eqiad&_=1502903293781
[17:17:21] <cwd>	 i notice it has some free memory until the english banners
[17:17:25] <cwd>	 and has been pegged since
[17:17:41] <ejegg>	 cwd is that the whole civi db cached in ram?
[17:17:44] <wikibugs>	 (PS5) Mepps: WIP Add phpcs script but times out on test [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314)
[17:18:19] <cwd>	 not sure, but probably
[17:18:26] <cwd>	 whatever it is querying heavily
[17:18:29] <ejegg>	 oh wait, green is 'cache'
[17:18:42] <cwd>	 it's deceptively named, it's pretty much the same thing and used
[17:18:51] <cwd>	 *as used
[17:18:54] <ejegg>	 does that mean disk cache is 98GB?
[17:19:09] <ejegg>	 and the 400GB is application-level caching?
[17:19:15] <wikibugs>	 (CR) Mepps: "Please note WIP. I'm still reviewing changes from composer fix." (1 comment) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps)
[17:19:17] <cwd>	 https://serverfault.com/questions/85470/meaning-of-the-buffers-cache-line-in-the-output-of-free
[17:19:53] <ejegg>	 hmm
[17:20:05] <ejegg>	 ok, looking at what else is running
[17:20:07] <cwd>	 it's lower level than application caching
[17:20:17] <ejegg>	 has our mail loader caught up to now yet?
[17:20:36] <cwd>	 i think so but not positive
[17:20:57] <ejegg>	 huh, well, the recipient load doesn't run at this time of day
[17:21:14] <ejegg>	 and the last time it did run, it took 38 seconds
[17:21:37] <ejegg>	 which is probably the amount of time it takes to make the setup api calls, then wait, then check the FTP site for new stuff
[17:22:22] <dstrine>	 cwd: and ejegg  is what you are discussing right now posing any risk to starting a campaign? I'm just trying to get context.
[17:22:22] <cwd>	 i am seeing some dedupe fail mails too
[17:22:52] <ejegg>	 the 'mailing load' job is also super quick lately
[17:22:53] <cwd>	 dstrine: yes, the context is that our db copies are having trouble catching up to the master
[17:23:08] <cwd>	 if the master server was to go down, we would lose all the data they haven't replicated
[17:23:16] <dstrine>	 ok thanks cwd
[17:23:34] <cwd>	 *why* they are not able to catch up is the million dollar question
[17:23:39] <ejegg>	 fr-tech any news for scrum of scrums?
[17:23:52] <cwd>	 dun think so
[17:24:55] <ejegg>	 should I mention we might want more dba help?
[17:25:19] <cwd>	 it's ok, they know
[17:25:23] <ejegg>	 k
[17:25:31] <cwd>	 they are just super swamped
[17:26:01] <cwd>	 i'm going to restart mysql on frdev1001 and just see if that fixes it
[17:27:01] <ejegg>	 Sorry, I keep forgetting, qre we doing slow query logging all the time?
[17:27:02] <ccogdill>	 heh turn it off and on :P
[17:27:16] <ejegg>	 hehe, yep
[17:28:02] <dstrine>	 ah now that ccogdill  is here... have you sent any emails for brazil yet?
[17:28:08] <ccogdill>	 tomorrow
[17:28:29] <dstrine>	 ok cool. Things are a little interesting today as you can see ^^^
[17:29:03] <jessicarobell>	 cwd and ejegg  - will you please let spatton know when we have green light to launch Brazil and also Malaysia and South Africa Mobile banners? I have to jump off for a bit
[17:29:18] <ccogdill>	 I do see!
[17:29:28] <ccogdill>	 will watch in case they continue to be interesting tomorrow
[17:29:40] <cwd>	 jessicarobell: will do
[17:29:51] <jessicarobell>	 wonderful! Thanks all!
[17:30:32] <cwd>	 well it's catching up now
[17:30:35] <cwd>	 frdev1001
[17:30:45] <cwd>	 the others are still getting worse
[17:30:50] <ejegg>	 huh, like way faster than 5000 sec?
[17:30:58] <cwd>	 yeah
[17:31:11] <cwd>	 should be caught up in a couple minutes
[17:31:17] <cwd>	 restarting mysql is the worst fix ever though
[17:31:17] <ejegg>	 weird
[17:32:38] <cwd>	 this could be a straight up bug in replication
[17:35:15] <icinga-wm>	 RECOVERY - check_mysql on frdev1001 is OK: Uptime: 381 Threads: 2 Questions: 1923633 Slow queries: 10 Opens: 270 Flush tables: 1 Open tables: 237 Queries per second avg: 5048.905 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[17:36:30] <AndyRussG>	 ejegg|mtg: nothing here, thx!!
[17:41:54] <mepps>	 XenoRyet did you ever get to look at: https://gerrit.wikimedia.org/r/#/c/370225/
[17:49:52] <ejegg|mtg>	 cwd all clear for the campaigns to start up?
[17:50:54] <cwd>	 the other 2 slaves are still getting worse
[17:51:07] <ejegg>	 restart 'em both too?
[17:51:15] <cwd>	 it would be really good to figure out what's going on here and the only chance is when they are broken
[17:51:23] <ejegg>	 We're not dumping any more data into civi as far as I can tell
[17:52:15] <ejegg>	 ok, what can we do to investigate?
[17:52:50] <mepps>	 ejegg do you know if there's a convention around writing tests for drush scripts?
[17:52:59] <cwd>	 my belief at this point is that there is a memory leak in the slave thread
[17:53:10] <cwd>	 i am trying to find any related information
[17:53:42] <ejegg>	 mepps I think best would be to make the drush-specific code as small as possible (just get options)
[17:53:55] <ejegg>	 then put all the real logic into a function that can be tested
[17:54:05] <ejegg>	 like in an actual class
[17:54:15] <cwd>	 there are tons of instances of this in mysql's past
[17:54:26] <mepps>	 ejegg cool, that makes sense and i think is close to what i'm doing
[17:54:28] <ejegg>	 and with dependencies provided in constructor
[17:54:31] <ejegg>	 mepps great!
[17:54:43] <mepps>	 but gotta write the tests :)
[17:54:54] <ejegg>	 sorry, or whatever kind of dependency injection makes the most sense
[17:55:07] <ejegg>	 mepps cool
[17:55:42] <ejegg>	 so, you'll want to use the smashpig TestingCOnfiguration stuff
[17:56:01] <ejegg>	 that'll set up a pending DB in a sqlite in-memory instance
[17:56:25] <ejegg>	 then you can populate that with your testing data
[17:57:53] <ejegg>	 if the rectifier is writing successfully rectified things to queues, the SmashPig testingconfiguration will also set those up with similar backing stores
[17:58:01] <ejegg>	 so you can test what gets sent
[17:58:25] <ejegg>	 cwd so what can we do to look for memory leaks?
[17:58:50] <cwd>	 search the internet for bugs i guess
[17:58:52] <ejegg>	 cwd can we dump a snapshot to examime offline?
[17:58:56] <cwd>	 is what i'm trying to do
[17:59:04] <cwd>	 snapshot of what?
[17:59:15] <ejegg>	 the current state of the machine
[17:59:44] <ejegg>	 so we can restart the real thing and start the campaign
[18:01:23] <cwd>	 i mean, no there's no way to do that
[18:01:26] <ejegg>	 eh, i guess it's not a setup you can test fixes on without replicating thw state of the whole network
[18:01:30] <cwd>	 restarting mysql is just kicking the can down the road
[18:01:39] <cwd>	 and we will hit this again soon
[18:01:48] <cwd>	 but if the banners are the priority i guess i'll do that
[18:03:21] <ejegg>	 would it help to set up some more monitoring / logging before next wednesday's big english test?
[18:03:41] <cwd>	 hmmm
[18:04:28] <cwd>	 well, let me dump some stats anyway and i'll restart the daemons
[18:04:51] <cwd>	 i think the logging/monitoring is pretty thorough
[18:05:15] <icinga-wm>	 RECOVERY - check_mysql on frdb2001 is OK: Uptime: 698395 Threads: 1 Questions: 20345006 Slow queries: 3856 Opens: 12097 Flush tables: 1 Open tables: 596 Queries per second avg: 29.131 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[18:05:43] <cwd>	 weird
[18:05:46] <cwd>	 i didn't touch that
[18:07:20] <ejegg>	 huh
[18:07:32] <ejegg>	 i'm going to make a quick store run while traffic's light
[18:07:53] <ejegg>	 cwd would you mind emailing jessicarobell when you're feeling OK about the dbs?
[18:08:05] <cwd>	 sure
[18:08:10] <ejegg>	 thanks!
[18:08:13] <cwd>	 np
[18:10:22] <wikibugs>	 (CR) Umherirrender: "Big change, hard to review" (25 comments) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps)
[18:11:17] <XenoRyet|afk>	 mepps:  I'm only half-here right now, but yea I took a look and it seems like a good direction to me.
[18:11:45] <XenoRyet|afk>	 It's still marked WIP, so I didn't give it the full code review, but it looks good.
[18:11:48] <mepps>	 XenoRyet, thanks! sounds good, yeah you have other things to focus on :)
[18:22:00] <spatton>	 hey cwd I'm actually here whenever you can give us the green light to enable our new campaigns
[18:24:07] <cwd>	 spatton: cool, should be shortly`
[18:33:53] <cwd>	 ugh and of course we don't have codfw monitoring
[18:34:27] <cwd>	 https://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=cpu_report&c=Fundraising+eqiad&h=frdev1001.frack.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=NOGROUPS
[18:34:34] <cwd>	 sure looks like a memory leak to me
[18:39:08] <cwd>	 permalink https://ganglia.wikimedia.org/latest/graph.php?h=frdev1001.frack.eqiad.wmnet&m=cpu_report&r=custom&s=by%20name&hc=4&mc=2&cs=08%2F16%2F2017%2016%3A30&ce=08%2F16%2F2017%2018%3A30&st=1502908611&g=mem_report&z=medium&c=Fundraising%20eqiad
[18:40:26] <cwd>	 i'm not sure what to say at this point, restarting mysql whenever we are going to run banners is not a solution
[18:51:55] * cwd relaxes to soothing crickets
[19:04:28] <wikibugs>	 (PS2) AndyRussG: [WIP, pls. don't merge] Clone campaign feature [extensions/CentralNotice] - https://gerrit.wikimedia.org/r/371125
[19:12:53] <wikibugs>	 (PS6) Mepps: WIP Add phpcs script but times out on test [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314)
[19:13:58] <wikibugs>	 (CR) Mepps: "Good call Umherrinder, will work on removing the array syntax switch to simplify this patch." (1 comment) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps)
[19:14:11] <wikibugs>	 (CR) Ejegg: "you might want to look at what I ended up doing in the SmashPig phpcs patch. I added a ton of exceptions to the phpcs.xml file for starter" [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/371770 (https://phabricator.wikimedia.org/T170314) (owner: Mepps)
[19:15:15] <icinga-wm>	 PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 7029
[19:16:24] <cwd>	 ah ha
[19:16:44] <ejegg>	 ah ha?
[19:16:45] <cwd>	 or...what's going on there?
[19:16:55] <cwd>	 that box recovered on its own, and now un-recovered
[19:17:13] <ejegg>	 extremely un-recovered, huh?
[19:17:33] <cwd>	 yeah
[19:17:56] <cwd>	 ejegg: https://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=cpu_report&c=Fundraising+eqiad&h=frdev1001.frack.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=NOGROUPS
[19:18:09] <cwd>	 looks a lot like a memory leak or caching problem in mysql
[19:20:15] <icinga-wm>	 PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4600
[19:20:41] <cwd>	 what is shredding the db at this point??
[19:21:19] <cwd>	 the saddest part about this is that the most interestingly behaving box (2001) doesn't have monitoring
[19:25:15] <icinga-wm>	 RECOVERY - check_mysql on frdb2001 is OK: Uptime: 703195 Threads: 2 Questions: 20682272 Slow queries: 3856 Opens: 12335 Flush tables: 1 Open tables: 596 Queries per second avg: 29.411 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[19:28:27] <cwd>	 weirder still
[19:28:40] <ejegg>	 jeez
[19:29:04] <ejegg>	 could the 'seconds behind master' metric just be totally off?
[19:31:43] <cwd>	 or it just doesn't mean what it sounds like it means
[19:32:09] <cwd>	 ejegg: that ganglia graph shows when i restarted mysql
[19:32:56] <cwd>	 2001 caught up all on its own
[19:33:15] <cwd>	 but then lagged again, i'm guessing whatever buffer is still full
[19:33:23] <cwd>	 but we don't have monitoring for codfw so who knows
[19:41:27] <ejegg>	 cwd how do you feel about starting the Brazil campaign now?
[19:42:26] <cwd>	 well i feel like it's unwise to prioritize the banners over the stability of the system
[19:43:11] <cwd>	 https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=frdb1002.frack.eqiad.wmnet&m=cpu_report&r=year&s=by%20name&hc=4&mc=2&st=1502912172&g=mem_report&z=large&c=Fundraising%20eqiad
[19:43:28] <cwd>	 https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=frdb1001.frack.eqiad.wmnet&m=cpu_report&r=year&s=by%20name&hc=4&mc=2&st=1502912172&g=mem_report&z=large&c=Fundraising%20eqiad
[19:43:38] <cwd>	 ok so what happened in late june to cause all the db boxes to be out of memory since then
[19:47:26] <wikibugs>	 Fundraising Sprint Outie Inverter, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Patch-For-Review, Unplanned-Sprint-Work: Civi imports currently aren't working. - https://phabricator.wikimedia.org/T172918#3528863 (DStrine) Open>Resolved
[19:47:54] <wikibugs>	 Fundraising Sprint Lou Reed, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising-Backlog, and 8 others: Update GC audit to read WX file - https://phabricator.wikimedia.org/T86090#3528867 (Ejegg) Open>Resolved
[19:48:23] <wikibugs>	 Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, and 5 others: Resultswitchers: send straight to ty page on reload - https://phabricator.wikimedia.org/T167990#3528877 (Ejegg) Open>Resolved
[19:48:25] <wikibugs>	 Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, and 4 others: PayPal EC dead session error - https://phabricator.wikimedia.org/T167923#3528878 (Ejegg)
[19:49:10] <wikibugs>	 Fundraising Sprint Baudelaire Bowdlerizer, Fundraising Sprint Costlier Alternative, Fundraising Sprint Deferential Equations, Fundraising Sprint English Cuisine, and 15 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3528888 (DStrine) @Pcoombe can yo...
[19:49:11] <wikibugs>	 Fundraising Sprint Outie Inverter, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: Allow use of Civimails - https://phabricator.wikimedia.org/T162747#3528890 (Ejegg) Open>Resolved
[19:49:25] <wikibugs>	 Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, and 5 others: Implement Ingenico Connect API call to create hosted payment - https://phabricator.wikimedia.org/T163946#3528891 (Ejegg) Open>Reso...
[19:50:15] <wikibugs>	 Fundraising Sprint Judgement Suspenders, Fundraising Sprint Kickstopper, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, and 5 others: retrieve lists of contacts who received a particular mailing - https://phabricator.wikimedia.org/T161762#3528921 (DStrine) @CCogdill_WMF I am...
[19:50:23] <wikibugs>	 Fundraising Sprint Judgement Suspenders, Fundraising Sprint Kickstopper, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, and 5 others: retrieve lists of contacts who received a particular mailing - https://phabricator.wikimedia.org/T161762#3528922 (DStrine) Open>Resolv...
[19:50:25] <wikibugs>	 Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Epic, FR-Email: EPIC: Add key silverpop data into Civi - https://phabricator.wikimedia.org/T161757#3528923 (DStrine)
[19:55:50] <wikibugs>	 Fundraising-Backlog, FR-PayPal-ExpressCheckout: Paypal EC - send orphan-y donors to TY page - https://phabricator.wikimedia.org/T173456#3528947 (Ejegg)
[19:59:52] <spatton>	 just to throw something else out, and admitting that I'm not following all this TOO closely - we were planning a 1-hour Japanese pre-test tonight. Is that a problem? Should we hold it? CC ejegg and cwd
[20:00:07] <spatton>	 and we have MY & ZA banners up...
[20:03:45] <cwd>	 ah ha
[20:03:49] <wikibugs>	 Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-Ingenico, FR-WMF-Audit: Ingenico WX audit parser: refunds missing gateway_txn_id - https://phabricator.wikimedia.org/T173457#3528972 (Ejegg)
[20:06:15] <wikibugs>	 Fundraising Sprint Baudelaire Bowdlerizer, Fundraising Sprint Costlier Alternative, Fundraising Sprint Deferential Equations, Fundraising Sprint English Cuisine, and 15 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3528988 (Pcoombe) Open>Res...
[20:09:16] <wikibugs>	 Fundraising-Backlog: dress rehearsal for MG event - https://phabricator.wikimedia.org/T173458#3528996 (DStrine)
[20:10:32] <cwd>	 ejegg, dstrine - are you all in a meeting?  i'd be happy to pop in and talk about this
[20:10:52] <cwd>	 spatton: trying to come up with a reasonable solution
[20:10:53] <dstrine>	 cwd yeah sprint planning
[20:11:05] <cwd>	 can you spare a few minutes?
[20:27:46] <spatton>	 Just want to reiterate a point pcoombe made in our FR slack channel, that we defer to your call on timing, cwd ; we can wait til tomorrow if that's the right course.
[20:29:55] <cwd>	 spatton: thanks! i'm going to execute a very hacking solution shortly but it should get us to a place where it's ok to drive traffic at payments
[20:30:03] <wikibugs>	 Fundraising Sprint p 2017, Fundraising-Backlog: dress rehearsal for MG event - https://phabricator.wikimedia.org/T173458#3529037 (DStrine)
[20:30:05] <wikibugs>	 Fundraising Sprint p 2017, Fundraising-Backlog, FR-PayPal-ExpressCheckout: Paypal EC - send orphan-y donors to TY page - https://phabricator.wikimedia.org/T173456#3529038 (DStrine)
[20:30:06] <spatton>	 sweet!
[20:30:07] <wikibugs>	 Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog: MY users getting express checkout at some times - https://phabricator.wikimedia.org/T173334#3529039 (DStrine)
[20:30:09] <wikibugs>	 Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, FR-Paypal, Patch-For-Review: PayPal audit parser shouldn't send duplicate recurring donations - https://phabricator.wikimedia.org/T172723#3529040 (DStrine)
[20:30:12] <wikibugs>	 Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 2 others: Create orphan rectifier for PayPal Express Checkout - https://phabricator.wikimedia.org/T172202#3529042 (DStrine)
[20:30:15] <wikibugs>	 Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 2 others: CentralNotice: JS selection widgets no longer work in interface to add a campaign - https://phabricator.wikimedia.org/T172023#3529043 (DStrine)
[20:30:17] <wikibugs>	 Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-Ingenico: Deal with recurring donations stuck in 'In Progress' status - https://phabricator.wikimedia.org/T171868#3529044 (DStrine)
[20:30:19] <wikibugs>	 Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 2 others: Populate country column when creating c_t rows during offline import - https://phabricator.wikimedia.org/T171658#3529045 (DStrine)
[20:30:21] <wikibugs>	 Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, Patch-For-Review: Strict error message in logs (quick tidy up) - https://phabricator.wikimedia.org/T171560#3529046 (DStrine)
[20:30:23] <wikibugs>	 Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 3 others: Are PayPal refunds for recurring donations incorrectly being tagged as EC or vice versa? - https://phabricator.wikimedia.org/T171351#3529047 (DStrine)
[20:30:25] <wikibugs>	 Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 2 others: Are we losing transactions witih repeated ct_id? - https://phabricator.wikimedia.org/T171349#3529048 (DStrine)
[20:30:27] <wikibugs>	 Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, and 4 others: Optimize filesystem access in thank_you email job - https://phabricator.wikimedia.org/T170435#3529050 (DStrine)
[20:30:29] <wikibugs>	 Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 2 others: Customize Ingenico hosted checkouts iframe - https://phabricator.wikimedia.org/T171346#3529049 (DStrine)
[20:30:32] <wikibugs>	 Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, and 4 others: process-control should not crash on bad utf-8 from stdout or stderr - https://phabricator.wikimedia.org/T167849#3529053 (DStrine)
[20:30:34] <wikibugs>	 Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, MediaWiki-extensions-DonationInterface: Update Mastercard logo - https://phabricator.wikimedia.org/T166795#3529054 (DStrine)
[20:30:37] <wikibugs>	 Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, and 4 others: CentralNotice: Add controls to purge banner content in Varnish for a specific language - https://phabricator.wikimedia.org/T168673#3529052 (...
[20:30:39] <wikibugs>	 Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, and 5 others: Implement Ingenico Connect API calls to get payment status - https://phabricator.wikimedia.org/T163948#3529055 (DStrine)
[20:30:41] <wikibugs>	 Fundraising Sprint p 2017, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-2016-17-Q2-Campaign-Support: Periodically run Civi contact import performance tests, note trends - https://phabricator.wikimedia.org/T146338#3529058 (DStrine)
[20:30:44] <wikibugs>	 Fundraising Sprint Ivory Tower Defense Games, Fundraising Sprint Judgement Suspenders, Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, and 9 others: Move already-encapsulated useful classes into SmashPig - https://phabricator.wikimedia.org/T163868#3529056 (DStrine)
[20:30:46] <wikibugs>	 Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 3 others: Import email-only contacts from 'remind me later' links into CiviCRM - https://phabricator.wikimedia.org/T160949#3529057 (DStrine)
[20:30:48] <wikibugs>	 Fundraising Sprint p 2017, Fundraising-Analysis, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: Probably cause of failmail deadlocks - soft credit search pain - https://phabricator.wikimedia.org/T130068#3529059 (DStrine)
[20:30:51] <wikibugs>	 Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, Fundraising-Backlog, and 5 others: Merge CRM and DonationInterface queue wrappers - https://phabricator.wikimedia.org/T95647#3529060 (DStrine)
[20:30:54] <wikibugs>	 Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint Outie Inverter, Fundraising Sprint p 2017, and 3 others: Clone button for CN campaigns - https://phabricator.wikimedia.org/T91078#3529061 (DStrine)
[20:51:08] <dstrine>	 pcoombe:  spatton  we should be good to go on banners
[20:51:25] <dstrine>	 I'll email this too just to be sure someone sees it.
[20:53:33] <dstrine>	 email sent
[20:59:56] <MBeat>	 thanks dstrine
[21:03:11] <spatton>	 thanks much dstrine! thanks for the hardwork cwd.
[21:07:20] <cwd>	 heh, the hard work is yet to come :)
[21:07:32] <spatton>	 :D
[21:33:20] <MBeat>	 BRL card transactions reaching Civi, TY emails going out, ahh
[21:52:44] <cwd>	 ejegg: so seconds_behind_master just means how long since it's been in sync
[21:52:53] <cwd>	 nothing to do with how far out of sync it is
[21:53:14] <cwd>	 2001 is lagging now
[21:53:24] <cwd>	 we should see alerts shortly
[21:53:53] <cwd>	 i do not, however, understand why seconds behind master goes down while it catches up
[21:54:28] <ejegg>	 cwd wait, that seems like the numbers would be more predictable if that was the interpretation
[21:55:10] <ejegg>	 oh right, the increasse is indeed predictable
[21:56:55] <cwd>	 2001 caught up again
[21:57:02] <cwd>	 frdev1001 never lagged
[21:58:04] <cwd>	 and for the first time today it's going *down* on 1002
[21:58:28] <cwd>	 Seconds_Behind_Master: 15642
[21:58:50] <cwd>	 hmmm i just thought of something
[21:58:55] <cwd>	 1002 is the read server
[21:59:06] <cwd>	 it could be locking excessively due to reads
[22:08:16] <cwd>	 so we put banners up and now things are starting to catch up
[22:08:18] <cwd>	 what the hell
[22:09:42] <cwd>	 maybe it knocked the clog loose!!!
[22:10:01] <cwd>	 >:|
[22:14:04] <ejegg>	 hah, weird
[22:17:19] <dstrine>	 https://media.giphy.com/media/DJSLqKN3HQCek/giphy.gif
[22:17:55] <cwd>	 the best part is i didn't even
[22:17:59] <cwd>	 i didn't do anything
[22:18:36] <cwd>	 except observe it
[22:18:49] <cwd>	 now that's what i call quantum computing volume -INF
[22:20:03] <cwd>	 scratch that the lag is growing again
[22:20:14] <icinga-wm>	 PROBLEM - check_procs on frdb1002 is CRITICAL: PROCS CRITICAL: 1201 processes
[22:20:28] <cwd>	 well that's a new one
[22:25:14] <icinga-wm>	 RECOVERY - check_procs on frdb1002 is OK: PROCS OK: 203 processes
[22:27:04] <dstrine>	 https://media.tenor.co/images/c942a1c40d796ca485960a2f12766acf/tenor.gif
[22:27:20] <cwd>	 haha, yes
[22:30:14] <icinga-wm>	 RECOVERY - check_mysql on frdb1002 is OK: Uptime: 1419622 Threads: 2 Questions: 40706317 Slow queries: 8498 Opens: 24954 Flush tables: 1 Open tables: 606 Queries per second avg: 28.674 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 176
[22:31:00] <cwd>	 first time that thing has caught up all day
[22:31:16] <cwd>	 i saw a couple of dedupe fails just now with:
[22:31:19] <cwd>	 DB Error: unknown error
[23:20:57] <wikibugs>	 Fundraising-Backlog, fundraising-tech-ops: Frack replag master thread - https://phabricator.wikimedia.org/T173472#3529511 (cwdent)
[23:21:32] <cwd>	 ok i'm going away-ish but will be checking in
[23:50:14] <icinga-wm>	 PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1306
[23:51:33] <cwd>	 oh good
[23:53:51] <cwd>	 i am restarting mysql on that box
[23:54:00] * cwd is alert-fatigued
[23:55:14] <icinga-wm>	 PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1424