[00:12:59] (PS1) Ejegg: Fix CurrencyRates class in exchange_rates module [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376450 [00:34:21] eileen: back [00:34:35] so what do you have in mind? [00:35:24] cwd - so there is a drush job that will insert a tonne of contacts [00:36:00] so I guess we should try to set up whatever monitoring we can [00:36:09] & the try running it & see if it triggers any upsets [00:36:17] & if that fails try the dedupe [00:36:27] & if that fails try both at once [00:36:43] & if that fails try some rerunning of mailing data retrieval [00:36:44] yeah sounds good [00:36:56] let me look at what prometheus metrics are available for a minute [00:37:13] we need to make sure jeff doesn't get a barrage of texts [00:37:16] also there is that pt-stalk thing but i believe we can fire that off when the load starts [00:37:42] heh well we can ack the alerts but i think he knows not to worry about it tonight [00:39:16] ok so I guess I'll get logged in & when you are ready we can try kick it off & we should log here when we do [00:40:15] yes good thinking [00:40:57] eileen: try this on frdev: [00:41:05] curl localhost:9004/metrics [00:41:16] it should barf a million mysql statistics [00:41:31] i don't know if there's a better way to parse what it provides [00:41:37] but this is the data that is now feeding grafana [00:43:46] hmm yeah - I see it - not sure what to make of it [00:44:09] this is the drush command I'm going to issue when you are ready [00:44:10] drush cvapi Omnigroupmember.load group_id=310 mail_provider=Silverpop group_identifier=18468760 [00:44:29] yeah i say go for it [00:46:07] !log started Omnigroupmember.load [00:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:57] note there is some throttling etc in there - we might be able to tone it down if replag does not ensue - but we need to give it a chance to break stuff first [00:47:58] so cwd what indicators are you looking at to see if there is delay [00:48:52] eileen: well [00:48:58] !7971 contacts [00:49:06] !log 7971 contacts so far [00:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:19] (logging to catch how aggressive creating is [00:49:38] !log 8389 contacts now [00:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:22] there is the "seconds behind master" number [00:50:29] that is graphed now [00:50:46] and there is the dirty pages number, but like i said earlier i think it's normal for that to go up under load [00:51:21] but as soon as "seconds behind master" is > 0 it means the replicas can't keep up [00:52:53] ok - any idea how long it has taken for things not to keep up in the past? [00:54:22] i have not observed a pattern [00:54:36] well that's not quite true [00:54:55] earlier today "seconds behind master" on 1002 started climbing pretty much right away [00:55:40] so when that happens i just start asking mysql questions, ps and top, whatever [00:56:06] & where are you looking so monitor seconds being - I think it's still 0? [00:56:47] eileen: https://grafana.wikimedia.org/dashboard/db/frack-db?orgId=1 [00:56:56] easy overview [00:57:08] otherwise you can do [00:57:13] show slave status \G [00:57:20] but you might need perms for that [00:57:31] ok - so I feel ike I'm seeing seconds behind climbing now [00:59:08] yeah, 1002 appears to be starting to lag [00:59:40] i'm going to try pt-stalk [00:59:49] but easing up again :-( [01:00:08] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [01:00:24] wat [01:00:30] !llog 17213 contacts now [01:00:38] the puppet? [01:04:01] yeah [01:04:10] unrelated i'm sure [01:04:13] cwd I feel like we can say this method DOES replicate the replag [01:05:08] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [01:05:50] agreed [01:06:00] !replag delay seconds behind master 237 [01:06:33] ok - it's been 30 mins now - should I kill it, let it catch up & retry [01:06:46] & see if within 30 mins we get replag again [01:06:52] if so I think we have a baseline [01:07:03] no it's been 20 mins [01:07:44] so i am not seeing any of the same queries as before in the pt-stalk files [01:08:36] but as before 1002 is the only one falling behind [01:09:02] of course 2001 is not graphed, i don't know why, i put it in the config but something to do with being in the other dc [01:09:19] do you want me to leave it going a bit - or should we restart to confirm we have a replicable way of testing? [01:09:45] hmmm [01:10:25] I can leave it going as long as you want - we just need to track how long & then retry IMHO to confirm we can do it 'on demand' [01:10:52] although the second time won't be exactly the same - there might be a slower start because it will try to reprocess some of the same ones :-( [01:11:12] I see Jeff is on ssh now [01:11:51] eileen: how big of a data backlog do you think we have? will it catch up and stop causing lag? [01:12:03] it will never catch up [01:12:10] nice [01:12:17] this job would, if allowed, bring an extra million contacts into civi [01:12:23] yeah then will you try killing it? [01:12:25] (which Caitlin says is a good thing [01:12:27] let's see if it catches up [01:12:35] heh :) [01:12:38] !log job killed [01:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:47] !log replag delay 422 [01:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:14:52] cwd - I added details about this test to 'what we know' section of ether pad [01:15:04] ty [01:15:17] glad (sarcastic) to see that grafana breaks the refresh button [01:15:47] also worth noting that this test is harder on the group_contact_cache than some of our jobs & we could try hacking those lines out as a test [01:16:17] eileen: would donation traffic also hammer that table? [01:16:44] I can't think of a way in which it would tbh [01:16:54] but we might be able to rule it out [01:18:06] i am still not seeing any of the same queries as before [01:18:10] but the same lag [01:18:33] beginning to think it is a more systemic problem [01:19:24] well that's helpful even if it doesn't feel like it [01:19:50] (PS1) Eileen: Temporarily hack out cache clearing, for test purposes only [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/376456 [01:19:54] oh yeah [01:20:04] any data is good at this point [01:20:34] so replag is still climbing on 1002 [01:20:53] so this is what I was thinking we could try to test group_contact_cache role https://gerrit.wikimedia.org/r/#/c/376456/ [01:21:14] but I think we should run once more for baseline purposes before changing things [01:21:25] Jeff_Green: added some details to etherpad about this test [01:22:00] !log replag 550 sec behind master [01:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:24:43] (PS2) Eileen: Temporarily hack out cache clearing, for test purposes only [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/376456 [01:29:16] !log replag 0 [01:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:36] !log replag 23 seconds [01:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:57] !log replag 197 seconds [01:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:00:08] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [02:00:22] interesting [02:00:31] I'll kill in 5 - that will be roughly the same as before [02:04:35] what the shit is that about [02:04:52] ? [02:04:56] puppet? [02:04:58] the puppet fail again [02:05:05] kill the puppet [02:05:05] same one, around the same time relative to lag [02:05:08] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:05:11] oh i wish i could [02:05:25] cwd ready for me to kill the job? [02:05:52] !log replag 217 behind [02:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:06:13] eileen: yep go for it [02:07:37] oh dang - I had not noticed a broken pipe :-) [02:07:52] it should be in it's death throes [02:11:25] so cwd I kind of think we achieved our main goal on this today - we could test this https://gerrit.wikimedia.org/r/#/c/376456/ [02:11:34] to see what it tells us about the role of caches [02:13:27] eileen: happy to throw a meaningless +2 at it :) [02:13:41] and we'll just see if it lags at about the same speed? [02:13:43] cwd - ok we'll deploy & test & revert [02:13:50] bueno [02:14:06] (CR) Cdentinger: [C: 2] Temporarily hack out cache clearing, for test purposes only [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/376456 (owner: Eileen) [02:18:59] (PS1) Eileen: Submodule commit [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376457 [02:19:11] (CR) Eileen: [C: 2] Submodule commit [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376457 (owner: Eileen) [02:20:06] (Merged) jenkins-bot: Temporarily hack out cache clearing, for test purposes only [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/376456 (owner: Eileen) [02:25:41] (Merged) jenkins-bot: Submodule commit [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376457 (owner: Eileen) [02:33:54] (PS1) Eileen: Merge branch 'master' of https://gerrit.wikimedia.org/r/wikimedia/fundraising/crm into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/376459 [02:34:18] (CR) Eileen: [C: 2] Merge branch 'master' of https://gerrit.wikimedia.org/r/wikimedia/fundraising/crm into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/376459 (owner: Eileen) [02:34:32] once that is merged I can deply & retry [02:35:26] (Merged) jenkins-bot: Merge branch 'master' of https://gerrit.wikimedia.org/r/wikimedia/fundraising/crm into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/376459 (owner: Eileen) [02:38:38] eileen: ready when you are! [02:40:28] !log update civicrm from 764bfe1f92cc8bbf8d7536744fb47b34ca7a4365 to 73a24b4ec193586c597eb16da4c0494a640e36fd [02:40:37] cwd v close [02:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:42:24] still rsyncing [02:43:24] done - hitting it [02:43:52] !log restarting groupmember.load after hacking out cache delete calls [02:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:46:35] eileen: just so i understand, this is a table (tables?) used for caching searches? [02:47:00] and you commented out the deletes, which would cause cache refreshes? [02:47:07] will your queries be hitting stale data? [02:48:12] cwd it's used for caching group_memberships [02:48:39] & yeah if this turned out to make a difference there are some things to look at [02:48:53] the hacky fix would still need to be reverted [02:52:45] dirty pages just started going up [02:53:40] Here is my stand up for next time - "I spent most of the day refreshing server graphs" [02:55:02] haha [02:55:08] sounds like many of my days [02:56:27] dirty pages going up but no real replag so far [02:57:34] yeah it's looking like it might be a real difference! [02:58:16] that would be unexpected [02:58:49] eileen: probably a long shot but can you think of anything that might have changed the performance of that caching in late july? [02:58:53] given the total time elapsed is a poor indicator due to the lead in - I think the time between getting dirty & starting replag is probably a good indicator [02:58:55] pretty sure that's the first time we saw this [02:59:30] with caching data as well as code can change it - ie. creating more groups [02:59:41] also, could read queries cause the cache to refresh? [02:59:47] sure yes that makes sense [02:59:58] but I'm also thinking about the timing of the civi upgrades [03:00:03] also i guess even if reads didn't cause a refresh they could be locking the tables [03:00:10] aah that is a good thought [03:00:10] because a code change could have happened [03:02:45] cwd there are a few options with cache clearing - one is to do it by cron - e.g every 5 mins [03:03:02] rather than accept the existing chrun [03:03:29] anyway if we don't see replag in the next 10 mins I'm happy to accept tthat as a 'strong line of enquriy' [03:03:46] & we can revert & do some more specific tests next week [03:04:17] maybe leave for 15 or 20 just to be sure [03:05:03] wow i am really surprised to see this [03:05:14] eileen: are you watching how many contacts inserted? [03:07:44] hm - I haven't this time - total is 53182 [03:08:10] I'm conscious after the first round the way I was checkign that was putting load on caches potentially [03:08:16] (since I was doing via UI) [03:08:39] hmm replag started to creep up [03:08:52] !log seconds behind 55 [03:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:09:30] rats [03:10:20] I guess I should have been checking whether it was just doing fine becuase they were all already in there! [03:11:22] eileen: although it seems like dirty pages would have stayed low too? [03:11:41] true - dirty is just for write? [03:12:24] hmm, i think so but i am not positive [03:13:14] it doesn't seem like reads would require writing to disk at all [03:13:38] !log seconds behind=78 [03:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:50] the trajectory increase seems slow [03:14:01] yeah [03:14:12] i think <1000 won't alert at all [03:14:21] so this amount of lag could be happening all the time [03:15:45] cwd it's fallen again! [03:16:31] eileen: it's still running? [03:16:57] no - just looked - DB Error::unknown error [03:17:07] haha, damn it [03:17:23] i notice dirty pages just flat lined too [03:17:26] :-( [03:17:43] quick catch up [03:17:43] so i have seen that error in several fail mails over the last few months [03:17:55] i wonder where i can look to get more info [03:18:04] yep - mysql bug - probably an overlong email in this case at a guess [03:18:29] ok so here is my thinking - that test was not conclusive but indicated it's worth doing more digging on [03:19:02] agreed, changing the caching seems to have changed the behavior [03:19:03] but we probably want to get some different csv handling so we don't have to try to guess-compare start times [03:19:18] (because we can't rule that out) [03:19:23] yep [03:19:26] and the next email in the csv is a problem [03:19:33] try to skp the existing ones without expensive selects? [03:19:40] so I think next steps is I do some work on those 2 issues [03:19:54] It might just be a process of post-cleaning the csv - not sure [03:20:11] is the nature of the problem email obvious? [03:20:19] too long, weird chars, etc? [03:20:21] I'll have to find it [03:20:27] ah cool [03:20:31] but I'm not thinking to do it now [03:20:45] am thinking to revert change & send email as to where we got to now [03:20:56] for sure [03:20:57] no hurry [03:21:03] this is the best progress we've made so far [03:21:36] of course i have thought that a few times :) [03:22:13] :-) [03:22:19] yep [03:22:31] I have to take kids out soon & will be late there [03:22:52] & I need to not forget the revert! [03:23:21] yeah i'll be signing off soon, but i could do the revert if you're in a hurry [03:23:29] (by perusing your bash history) [03:23:35] (PS1) Eileen: Revert "Temporarily hack out cache clearing, for test purposes only" [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/376461 [03:23:53] can you + 2 it ^^ [03:24:29] no problem, just need to wait for gerritbot [03:24:48] iirc if i do it out of order it won't submit [03:25:55] (PS5) Eileen: Remove void data from city field. [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/375948 (https://phabricator.wikimedia.org/T174985) [03:25:57] (PS12) Eileen: Strip double whitespace & html ampersands from first_name, last_name fields [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376169 (https://phabricator.wikimedia.org/T175107) [03:25:59] (PS1) Eileen: Submodule commit [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376462 [03:26:04] it will… eventually [03:28:23] go zulll go [03:30:17] come on!! [03:33:58] (CR) Cdentinger: [C: 2] Revert "Temporarily hack out cache clearing, for test purposes only" [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/376461 (owner: Eileen) [03:43:18] cwd I have to relocate - I'll try again to deply that after kids dentist trip - merge is sitting here once that merges https://gerrit.wikimedia.org/r/#/c/376462/ [03:43:19] Fundraising Sprint Quill Pencil, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, and 2 others: WMDE banners failing to save - Timing out on save - https://phabricator.wikimedia.org/T170591#3587219 (AndyRussG) Yes, agreed that UBN isn't right at this point... [03:43:39] Fundraising Sprint Quill Pencil, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, and 2 others: WMDE banners failing to save - Timing out on save - https://phabricator.wikimedia.org/T170591#3587220 (AndyRussG) p:Unbreak!>High [03:46:20] (Merged) jenkins-bot: Revert "Temporarily hack out cache clearing, for test purposes only" [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/376461 (owner: Eileen) [03:59:02] (CR) Cdentinger: [C: 2] Submodule commit [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376462 (owner: Eileen) [05:45:14] (PS2) Eileen: Submodule commit [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376462 [05:57:37] (PS1) Eileen: Merge branch 'master' of https://gerrit.wikimedia.org/r/wikimedia/fundraising/crm into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/376474 [06:04:16] (CR) Eileen: [C: 2] Merge branch 'master' of https://gerrit.wikimedia.org/r/wikimedia/fundraising/crm into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/376474 (owner: Eileen) [06:05:01] (Merged) jenkins-bot: Merge branch 'master' of https://gerrit.wikimedia.org/r/wikimedia/fundraising/crm into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/376474 (owner: Eileen) [06:07:02] !log civicrm updated from 73a24b4ec193586c597eb16da4c0494a640e36fd to c1ece1e0d97e6a4ea397bdb6c04f175bded0f4c4 [06:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:05] Fundraising Sprint Quill Pencil, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, and 2 others: WMDE banners failing to save - Timing out on save - https://phabricator.wikimedia.org/T170591#3587309 (Nemo_bis) >>! In T170591#3501942, @AndyRussG wrote: > Hi..... [06:25:39] fundraising-tech-ops, Performance-Team, Wikimedia-General-or-Unknown, Performance: 2017 USA banners take 1-2 seconds to load - https://phabricator.wikimedia.org/T174267#3587338 (Nemo_bis) Thanks Gilles for the analysis. I understand that the URL parameters don't give a perfectly accurate view, bu... [06:32:28] fundraising-tech-ops, Performance-Team, Wikimedia-General-or-Unknown, Performance: 2017 USA banners may freeze your browser for 1-2 seconds - https://phabricator.wikimedia.org/T174267#3587355 (Nemo_bis) [06:52:54] fundraising-tech-ops, Performance-Team, Wikimedia-General-or-Unknown, Performance: 2017 USA banners may freeze your browser for 1-2 seconds - https://phabricator.wikimedia.org/T174267#3587383 (Gilles) There's a difference between actually freezing your browser (the page can't physically be intera... [07:00:08] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [07:05:08] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:40:08] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [07:45:08] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [07:50:08] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 11 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [07:55:09] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:55:09] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1931 [08:00:18] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 497700 Threads: 1 Questions: 12899492 Slow queries: 2419 Opens: 6109 Flush tables: 1 Open tables: 604 Queries per second avg: 25.918 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1375 [08:15:08] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [08:20:08] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [08:25:08] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:30:18] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [09:35:08] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:00:08] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [11:05:08] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:15:09] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [11:20:09] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 11 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [11:25:09] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [12:45:10] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [12:50:10] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 11 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [12:55:10] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [13:12:20] Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Wikimedia-CentralNotice-Administration: CentralNotice Banners suppressed by Translation Extension - https://phabricator.wikimedia.org/T175261#3588076 (Jseddon) [13:13:54] Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Wikimedia-CentralNotice-Administration: CentralNotice Banners suppressed by Translation Extension on wiki pages - https://phabricator.wikimedia.org/T175261#3588089 (Jseddon) p:Triage>High [15:35:20] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by Translation Extension on Meta wikipages - https://phabricator.wikimedia.org/T175261#3588715 (Jseddon) [15:37:19] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by Translation Extension on Meta wikipages - https://phabricator.wikimedia.org/T175261#3588720 (gpaumier) [15:42:49] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by Translation Extension on Meta wikipages - https://phabricator.wikimedia.org/T175261#3588728 (Jseddon) https:/... [15:44:04] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by Translation Extension on Meta wikipages - https://phabricator.wikimedia.org/T175261#3588731 (Jseddon) p:Hi... [15:46:33] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by Translation Extension on Meta wikipages - https://phabricator.wikimedia.org/T175261#3588076 (Jdforrester-WMF)... [15:50:52] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by Translation Extension on Meta wikipages - https://phabricator.wikimedia.org/T175261#3588752 (Jseddon) This a... [15:53:37] (PS6) Ejegg: Remove void data from city field. [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/375948 (https://phabricator.wikimedia.org/T174985) (owner: Eileen) [15:53:47] (CR) Ejegg: [C: 2] Remove void data from city field. [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/375948 (https://phabricator.wikimedia.org/T174985) (owner: Eileen) [15:53:55] (PS13) Ejegg: Strip double whitespace & html ampersands from first_name, last_name fields [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376169 (https://phabricator.wikimedia.org/T175107) (owner: Eileen) [15:54:02] (CR) Ejegg: [C: 2] Strip double whitespace & html ampersands from first_name, last_name fields [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376169 (https://phabricator.wikimedia.org/T175107) (owner: Eileen) [15:58:28] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by Translation Extension on Meta wikipages - https://phabricator.wikimedia.org/T175261#3588767 (Jdforrester-WMF)... [15:58:35] (CR) Ejegg: [C: 2] [merged upstream] CRM-19612 Dedupe: dodge problems introduced by query union [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/373176 (https://phabricator.wikimedia.org/T160571) (owner: Eileen) [15:58:36] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by the Translate extension on Meta-Wiki pages - https://phabricator.wikimedia.org/T175261#3588768 (Nemo_bis) [15:59:53] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by the Translate extension on Meta-Wiki pages - https://phabricator.wikimedia.org/T175261#3588076 (Nemo_bis) Jav... [16:00:54] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by the Translate extension on Meta-Wiki pages - https://phabricator.wikimedia.org/T175261#3588780 (Jdforrester-WM... [16:05:13] (Merged) jenkins-bot: Remove void data from city field. [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/375948 (https://phabricator.wikimedia.org/T174985) (owner: Eileen) [16:07:40] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by the Translate extension on Meta-Wiki pages - https://phabricator.wikimedia.org/T175261#3588793 (Nemo_bis) [16:07:43] (PS9) Ejegg: Capture Adyen payments missing pending messages [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/319489 (https://phabricator.wikimedia.org/T149861) [16:08:35] (CR) jerkins-bot: [V: -1] Capture Adyen payments missing pending messages [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/319489 (https://phabricator.wikimedia.org/T149861) (owner: Ejegg) [16:08:38] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by the Translate extension on Meta-Wiki pages - https://phabricator.wikimedia.org/T175261#3588076 (Nikerabbit) N... [16:11:02] (Merged) jenkins-bot: Strip double whitespace & html ampersands from first_name, last_name fields [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376169 (https://phabricator.wikimedia.org/T175107) (owner: Eileen) [16:11:19] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by the Translate extension on Meta-Wiki pages - https://phabricator.wikimedia.org/T175261#3588815 (Nikerabbit) j... [16:12:29] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by the Translate extension on Meta-Wiki pages - https://phabricator.wikimedia.org/T175261#3588817 (Nikerabbit) I... [16:14:14] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by the Translate extension on Meta-Wiki pages - https://phabricator.wikimedia.org/T175261#3588830 (Nikerabbit) `... [16:15:17] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by the Translate extension on Meta-Wiki pages - https://phabricator.wikimedia.org/T175261#3588858 (Nikerabbit) T... [16:16:32] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by the Translate extension on Meta-Wiki pages - https://phabricator.wikimedia.org/T175261#3588871 (Nikerabbit) [16:20:07] (Merged) jenkins-bot: [merged upstream] CRM-19612 Dedupe: dodge problems introduced by query union [wikimedia/fundraising/crm/civicrm] - https://gerrit.wikimedia.org/r/373176 (https://phabricator.wikimedia.org/T160571) (owner: Eileen) [16:24:29] Fundraising-Backlog, MediaWiki-ResourceLoader, MediaWiki-extensions-CentralNotice, Wikimedia-CentralNotice-Administration: ResourceLoader suppressed by the Translate extension on Meta-Wiki pages - https://phabricator.wikimedia.org/T175261#3588893 (Nemo_bis) Resolved by https://meta.wikimedia.org/... [16:27:27] (PS10) Ejegg: Capture Adyen payments missing pending messages [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/319489 (https://phabricator.wikimedia.org/T149861) [16:27:28] (PS1) Ejegg: Always clear all database statics in tearDown [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/376542 [16:28:27] (CR) jerkins-bot: [V: -1] Always clear all database statics in tearDown [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/376542 (owner: Ejegg) [16:28:33] (CR) jerkins-bot: [V: -1] Capture Adyen payments missing pending messages [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/319489 (https://phabricator.wikimedia.org/T149861) (owner: Ejegg) [16:30:00] (PS2) Ejegg: Always clear all database statics in tearDown [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/376542 [16:30:25] (PS11) Ejegg: Capture Adyen payments missing pending messages [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/319489 (https://phabricator.wikimedia.org/T149861) [16:33:06] Fundraising Sprint Quill Pencil, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, and 2 others: WMDE banners failing to save - Timing out on save - https://phabricator.wikimedia.org/T170591#3588916 (AndyRussG) >>! In T170591#3587309, @Nemo_bis wrote: >>>! I... [16:34:44] (PS9) Ejegg: Use ct_id to find completed, avoid race [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/312563 (https://phabricator.wikimedia.org/T143945) [16:42:30] (PS1) Ejegg: CiviCRM submodule update [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376544 [16:42:35] (CR) Ejegg: [C: 2] CiviCRM submodule update [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376544 (owner: Ejegg) [16:44:30] (PS1) Ejegg: Merge branch 'master' into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/376545 [16:44:45] (CR) Ejegg: [C: 2] Merge branch 'master' into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/376545 (owner: Ejegg) [16:53:09] (Merged) jenkins-bot: CiviCRM submodule update [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376544 (owner: Ejegg) [16:53:11] (Merged) jenkins-bot: Merge branch 'master' into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/376545 (owner: Ejegg) [17:17:32] Mornin XenoRyet [17:17:45] mornin' [17:21:18] Fundraising-Backlog, fundraising-tech-ops, Performance-Team, Wikimedia-General-or-Unknown, Performance: 2017 USA banners may freeze your browser for 1-2 seconds - https://phabricator.wikimedia.org/T174267#3589034 (DStrine) [17:24:08] Fundraising-Backlog, fundraising-tech-ops, MediaWiki-extensions-CentralNotice, Performance-Team, and 2 others: 2017 USA banners may freeze your browser for 1-2 seconds - https://phabricator.wikimedia.org/T174267#3589047 (DStrine) [17:32:15] XenoRyet: mind reviewing a couple little things? [17:32:21] Super trivial: https://gerrit.wikimedia.org/r/376450 [17:32:41] and just some testing stuff, but blocking something eileen already +2ed: [17:33:04] https://gerrit.wikimedia.org/r/375413 [17:34:24] Yea, I'll take a look [17:35:17] (CR) XenoRyet: [C: 2] Fix CurrencyRates class in exchange_rates module [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376450 (owner: Ejegg) [17:35:31] thanks! [17:37:11] (CR) XenoRyet: [C: 2] Update RecurringGlobalCollectTest for static traits [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/375413 (owner: Ejegg) [17:37:24] No worries. Both pretty painless. [17:37:33] rockin, I'll roll those in with the crm deploy i was about to do [17:38:14] Good times [17:39:11] XenoRyet: was that email last night about the damaged messages intelligible? [17:39:40] Yea, made sense. [17:40:40] I was just working on getting that patch to pass tests again. I'm pretty sure it's just our test file for EC recurring is wrong, so I'm going to go find a new one to anonamize and make sure we're testing the right thing. [17:41:02] Fundraising-Backlog, Wikimedia-Fundraising-Banners, fundraising-tech-ops, MediaWiki-extensions-CentralNotice, and 3 others: 2017 USA banners may freeze your browser for 1-2 seconds - https://phabricator.wikimedia.org/T174267#3589119 (Pcoombe) [17:41:51] (Merged) jenkins-bot: Fix CurrencyRates class in exchange_rates module [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376450 (owner: Ejegg) [17:45:00] (Merged) jenkins-bot: Update RecurringGlobalCollectTest for static traits [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/375413 (owner: Ejegg) [17:47:37] (Merged) jenkins-bot: Cancel subscription on red-flag declined [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/375449 (https://phabricator.wikimedia.org/T174450) (owner: Ejegg) [17:49:32] (PS1) Ejegg: Merge branch 'master' into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/376560 [17:49:36] (CR) Ejegg: [C: 2] Merge branch 'master' into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/376560 (owner: Ejegg) [17:50:36] (Merged) jenkins-bot: Merge branch 'master' into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/376560 (owner: Ejegg) [17:52:09] !log updated CiviCRM from c1ece1e0d97e6a4ea397bdb6c04f175bded0f4c4 to 63288ae8b3bac2f15804d4076c4ede55ba4a3759 [17:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:09] !log disabled queue consumers for data cleaning job [17:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:57] Fundraising-Backlog, Wikimedia-Fundraising-Banners, fundraising-tech-ops, MediaWiki-extensions-CentralNotice, and 4 others: 2017 USA banners may freeze your browser for 1-2 seconds - https://phabricator.wikimedia.org/T174267#3589197 (Jseddon) [17:59:25] cwd running a job that'll update a whole lot of contact and address rows [18:00:30] Fundraising Sprint Prank Seatbelt, Fundraising Sprint Quill Pencil, Fundraising-Backlog, Patch-For-Review, Unplanned-Sprint-Work: Deploy pt-br and ja thank you letter - https://phabricator.wikimedia.org/T173809#3589207 (Ejegg) Open>Resolved [18:01:21] Fundraising Sprint Quill Pencil, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Patch-For-Review: Scripts with a time limit should use real process elapsed time - https://phabricator.wikimedia.org/T174690#3589225 (Ejegg) Open>Resolved [18:05:29] Elegg, cool I am out for an hour or two but cc Jeff_green [18:26:29] Fundraising Sprint Prank Seatbelt, Fundraising Sprint Quill Pencil, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Patch-For-Review: civi dedupe: offer dedupe option in a regular search - https://phabricator.wikimedia.org/T151270#3589337 (Ejegg) Open>Resolved [18:28:00] Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Quill Pencil, Fundraising-Backlog, and 3 others: Deal with paypal EC error 11607, show thank you page - https://phabricator.wikimedia.org/T165635#3589339 (Ejegg) Open>Resolved [18:33:01] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Fr-CiviCRM-dedupe-FY2017/18, Epic: Epic: Dedupe V2: resolve top conflicts - https://phabricator.wikimedia.org/T143057#3589355 (Ejegg) [18:33:03] Fundraising Sprint Quill Pencil, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Fr-CiviCRM-dedupe-FY2017/18, Patch-For-Review: Remove invalid City data from DB - https://phabricator.wikimedia.org/T174985#3589354 (Ejegg) Open>Resolved [18:36:07] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Fr-CiviCRM-dedupe-FY2017/18, Epic: Epic: Dedupe V2: resolve top conflicts - https://phabricator.wikimedia.org/T143057#3589366 (Ejegg) [18:36:09] Fundraising Sprint Quill Pencil, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Fr-CiviCRM-dedupe-FY2017/18, Patch-For-Review: Data clean up - remove double spaces html ampersands from first & last names - https://phabricator.wikimedia.org/T175107#3589363 (Ejegg) Open>Resolved p:... [18:55:10] (PS5) XenoRyet: Fix PayPal Gateway Tagging [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/374913 (https://phabricator.wikimedia.org/T171351) [18:56:33] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: Retry transaction after lock contention - https://phabricator.wikimedia.org/T111130#3589487 (Ejegg) [18:56:36] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Patch-For-Review: Deadlock should result in requeueing the message - https://phabricator.wikimedia.org/T118487#3589490 (Ejegg) [18:57:07] Fundraising Sprint Quill Pencil, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Patch-For-Review: Deadlock should result in requeueing the message - https://phabricator.wikimedia.org/T118487#1801561 (Ejegg) a:Ejegg [18:59:00] (PS1) Ejegg: Retry deadlock / lock wait failures [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/376569 (https://phabricator.wikimedia.org/T118487) [19:00:17] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [19:05:08] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:19:43] (PS4) Ejegg: Drop autoincrement ID on group_contact_cache [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/374588 (https://phabricator.wikimedia.org/T174404) [19:20:16] (PS5) Ejegg: Drop autoincrement ID and FKs on group_contact_cache [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/374588 (https://phabricator.wikimedia.org/T174404) [19:21:35] (CR) Ejegg: "Updated to drop the FK constraints too. I guess the only thing we're really losing is the ON DELETE CASCADE. And that shouldn't matter too" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/374588 (https://phabricator.wikimedia.org/T174404) (owner: Ejegg) [19:25:11] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1212 [19:25:45] well, that's just super... [19:30:11] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1288 [19:35:11] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1367 [19:40:11] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1416 [19:42:51] fr-tech anything important to discuss at standup? [19:43:01] There's some hot food waiting for me... [19:43:16] I'll send an email update if there's nothing urgent [19:43:33] Nothing urgent. That patch is ready for review when you get a chance, but not worth missing hot food. [19:45:09] PROBLEM - check_mysql on frdb1002 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1433 [20:25:17] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4915 [20:30:17] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4276 [20:30:30] siiiiigh [20:30:47] well this will be some good data from prometheus anyway [20:32:20] why 2001 is the only one suffering i have no idea [20:33:36] (PS1) Ejegg: Add __toString method for PaymentTransactionResponse [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/376585 [20:33:51] cwd oh huh, 1002 was complaining an hour ago [20:34:49] yeah, it seems to have caught up all at once [20:35:08] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 543000 Threads: 1 Questions: 16029156 Slow queries: 4103 Opens: 6526 Flush tables: 1 Open tables: 606 Queries per second avg: 29.519 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [20:35:39] como eso [20:35:53] yet frdev1001 never flinched [20:37:41] huh, I guess we've never seen all the replicas lag at once [20:37:59] so we haven't actually been in danger of losing data [20:38:27] well during the english tests like a month ago they were all laggin [20:38:41] when the bad ones were getting to like 16k seconds behind [20:38:59] but i have never seen the lag be symmetrical between all of them [20:40:55] ah, ok [20:41:30] we also had all the jobs running then, but i am less and less convinced that it even matters [20:41:38] ejegg: oh you might know... [20:41:55] https://grafana.wikimedia.org/dashboard/db/frack-db?orgId=1&from=1504802501095&to=1504816901099 [20:42:11] That's today? [20:42:12] big dirty pages spike in the night, *without* an associated lag spike [20:42:18] yeah [20:42:24] i'm looking at the thing around 1800 [20:42:42] think that is a p-c job? [20:42:55] Was that the script I ran to clean up addresses and names? [20:43:18] that was 14:00 IRC time [20:43:19] oh yeah it could have been [20:43:22] sorry wrong one [20:43:28] yeah, totally was ten [20:43:29] then [20:43:50] a France pre-test started a half hour later [20:44:01] https://grafana.wikimedia.org/dashboard/db/frack-db?orgId=1&from=1504701822156&to=1504817022188 [20:44:20] so between 6 and 8 UTC (?) [20:44:36] guessing that's the silverpop export, lemme see [20:44:55] thing is [20:44:58] and i may be wrong [20:45:06] but i don't think read activity would increase dirty pages [20:45:36] well, it does drop and repopulate a bunch of tables in the silverpop db [20:45:46] aah ok, that makes sense then [20:45:50] oh, and I think it only does that on the replica! [20:46:08] interesting, master also spikes during that period [20:46:12] yeah, that's all the frdev stuff [20:46:23] Might be something else going on on master [20:46:51] in any case it is interesting that some heavy write acitivty doesn't cause lag [20:47:21] well, the silverpop job wouldn't be replicated anywhere since it's only on frdev [20:47:45] wonder why the master is thrashing at the same time [20:47:56] eileen and i did some testing last night [20:48:02] not sure if you read her email [20:48:17] yeah, sounds like those cache tables are likely culprits [20:48:19] but she hacked in some early returns so it was ignoring the group cache table stuff [20:48:31] it's definitely possible [20:48:46] but right as we were beginning to see an interesting trend it crashed with DB Error: Unknown Error [20:48:58] ah dang [20:49:06] which i have seen in some fail mails and eileen says it is when the importer chokes on a funky email address [20:49:38] but yeah it did not seem that replag was growing when the caching was off [20:50:28] there are just so many variables [20:50:38] but these prometheus graphs are pretty handy already [20:55:20] (PS6) Ejegg: Fix PayPal Gateway Tagging [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/374913 (https://phabricator.wikimedia.org/T171351) (owner: XenoRyet) [20:56:45] (CR) Ejegg: [C: 2] "A definite improvement! We should also deal with reason_code" [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/374913 (https://phabricator.wikimedia.org/T171351) (owner: XenoRyet) [20:57:46] (Merged) jenkins-bot: Fix PayPal Gateway Tagging [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/374913 (https://phabricator.wikimedia.org/T171351) (owner: XenoRyet) [21:37:51] XenoRyet: not sure how to classify those two other reason_codes, but maybe mbeat would know [21:38:17] also, lmk if you want any help with cleaning up the damaged queue [21:38:40] Probably makes sense to script that out for review [21:43:14] ejegg: Yea, I don't know about those reason_codes either. We apparently haven't gotten any yet, but that doesn't mean we won't. [21:43:54] I read those two out of the damaged db [21:44:21] Oh did you? I swear I had searched for them at some point. [21:49:06] Fundraising Sprint Quill Pencil, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: Create lists of CiviCRM groups to allow MG & DS to review with a view to tidy up - https://phabricator.wikimedia.org/T174407#3589932 (Jgreen) [21:49:09] Fundraising Sprint Quill Pencil, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: Restore Live CiviCRM data to staging - https://phabricator.wikimedia.org/T174636#3589930 (Jgreen) Open>Resolved this was done as of Tuesday PM, sorry I forgot to comment on the task [21:50:24] ejegg: But yea, I could use a hand with the damaged stuff if you've got time. I was just about to click through the UI and clean those up. You wanna take a crack at the other part? [21:50:57] You mean the non-EC ones? [21:51:07] Yea, doing that update [21:52:14] OK, I'll give that a shot once I get to a good stopping spot in this minFraud stuff [21:52:34] Cool, thanks. [22:00:07] PROBLEM - check_puppetrun on americium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [22:05:17] RECOVERY - check_puppetrun on americium is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [22:15:05] K4-713: the 'minfraud_log_mailer' is not a thing anymore, is it? [22:16:20] https://github.com/wikimedia/mediawiki-extensions-DonationInterface/blob/7ac8d07d1f5127bca6e0bdac4d7454196c87d0c5/extras/custom_filters/filters/minfraud/minfraud.body.php#L314 [22:16:44] just wondering if that log line is still as brittle as that comment suggests [22:16:46] ejegg: Oh man. [22:17:15] 'cause the new minfraud sdk doesn't want the params in that format any more [22:18:55] (PS1) XenoRyet: Merge branch 'master' into HEAD [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/376637 [22:19:42] (CR) XenoRyet: [C: 2] Merge branch 'master' into HEAD [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/376637 (owner: XenoRyet) [22:19:44] (CR) jerkins-bot: [V: -1] Merge branch 'master' into HEAD [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/376637 (owner: XenoRyet) [22:20:29] (CR) jerkins-bot: [V: -1] Merge branch 'master' into HEAD [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/376637 (owner: XenoRyet) [22:24:58] ejegg: Help me understand what's going on with that? ^ [22:25:44] XenoRyet: looks like composer.json has a merge conflict [22:26:05] It did, but I thought I fixed it. [22:26:16] https://gerrit.wikimedia.org/r/#/c/376637/1/composer.json [22:26:44] ohhh right, that's in red [22:28:03] d'oh, phpunit is failing b/c we've deleted the phpunit.xml on deployment [22:28:12] well shoot [22:28:26] we can take the 'phpunit' out of composer.json [22:28:47] but I don't like the thought of maintaining two different .json files [22:29:08] can we tell it to only run phpunit if phpunit.xml exists? [22:29:12] let me see [22:29:43] ejegg: Sorry, I'm all stuck in meetings until like 5pm. [22:30:15] I'm preeeeeetty sure that the fraud log mailer was killed years ago, but I can't really check. [22:30:22] People use fredge for that now. [22:30:27] That was pre-fredge technology. [22:32:19] k, I sure hope so! [22:50:07] (PS2) Ejegg: WIP upgrade to new Minfraud Composer package [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/375454 (https://phabricator.wikimedia.org/T128902) [22:52:14] (CR) jerkins-bot: [V: -1] WIP upgrade to new Minfraud Composer package [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/375454 (https://phabricator.wikimedia.org/T128902) (owner: Ejegg) [22:53:28] (PS3) Ejegg: WIP upgrade to new Minfraud Composer package [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/375454 (https://phabricator.wikimedia.org/T128902) [22:54:45] (PS1) Ejegg: Only run phpunit when the config file is present [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/376642 [22:55:07] XenoRyet: ^^^ should help, I think [22:55:44] I used [ ! -f phpunit.xml ] || [22:55:54] instead of [ -f phpunit.xml ] && [22:56:03] so that it would return true if the file doesn't exist [22:56:18] Right [22:57:17] (CR) XenoRyet: [C: 2] Only run phpunit when the config file is present [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/376642 (owner: Ejegg) [22:58:15] (Merged) jenkins-bot: Only run phpunit when the config file is present [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/376642 (owner: Ejegg) [22:58:22] heh, cool, it's still running tests in CI [22:59:27] ejegg: re: the mailer... you may want to check to see if the messages that get shoveled into fredge are similarly fragile, though... [23:00:05] Also: apologies. I hate having conversations with people that have an apparent 30-minute ping time. :/ [23:00:23] Today, I will do that to *everyone*. [23:00:57] Pretty sure the fredge messages for minfraud are just minfraud_filter and a score [23:01:02] so I left that alone [23:01:05] Rad. [23:01:24] I think you could... nay, should! remove the mailer references. [23:01:40] (Abandoned) XenoRyet: Merge branch 'master' into HEAD [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/376637 (owner: XenoRyet) [23:01:55] Jeff_Green: Around? [23:02:16] (PS4) Ejegg: WIP upgrade to new Minfraud Composer package [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/375454 (https://phabricator.wikimedia.org/T128902) [23:07:07] (PS1) XenoRyet: Merge branch 'master' into HEAD [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/376643 [23:08:28] (CR) XenoRyet: [C: 2] Merge branch 'master' into HEAD [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/376643 (owner: XenoRyet) [23:09:24] (Merged) jenkins-bot: Merge branch 'master' into HEAD [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/376643 (owner: XenoRyet) [23:10:44] (PS5) Ejegg: WIP upgrade to new Minfraud Composer package [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/375454 (https://phabricator.wikimedia.org/T128902) [23:15:17] !log Updated Smashpig from 8eb98c10ff1129ea65dd81559656001c819cc12f to dce4f0ac2238e25cbf00a51d4b81adde2532ddb8 [23:15:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:58] K4-713: now I am, what's up? [23:17:34] * Jeff_Green changing networks, back in a minute [23:17:40] Jeff_Green: Blast from the past: We killed the minfraud log mailer 1000 years ago, right? [23:20:13] Like, right about the time the ice started receding and we stood up fredge? [23:21:42] K4-713: think he missed the initial question while switching networks [23:21:47] oh yeah i did [23:22:03] XenoRyet: I'll take a crack at the damaged message db tomorrow. [23:22:16] 10-4 [23:22:24] in the meantime, I'd love some feedback on the MinFraud update before I go much further with it [23:22:32] https://gerrit.wikimedia.org/r/375454 [23:22:45] Sure, I'll take a bit of a look before it's quitting time. [23:22:53] great! [23:23:11] You got another patch coming for the other reason codes? [23:26:41] (PS6) Ejegg: WIP upgrade to new Minfraud Composer package [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/375454 (https://phabricator.wikimedia.org/T128902) [23:27:19] i'll keep an eye out for that [23:27:26] anyway, have a good night, all! [23:28:20] K4-713: what were you saying about fredge? [23:49:07] Agh, there goes my 30-minute ping time again. [23:49:12] Terrible. [23:49:21] Ah well.