[00:05:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1219 [00:07:21] eileen: doesn't look like the same job as lately? [00:08:54] cwd oh [00:08:59] what can we tell about that? [00:09:48] I've run something from the command line just now that is updating a smaller number of rows - perhaps 30000 in total [00:10:06] it could be that - I didn't think it would cause replag [00:10:08] maybe that's it, i am not seeing the same queries as i have been [00:10:11] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1455 [00:10:24] although, if it's only frdev then is there something else in play? [00:10:54] are the queries you are seeing the 'cause' or the symptom? [00:11:11] good question [00:11:18] i don't see anything long running [00:11:26] Nothing is running now I don't think [00:11:42] i am seeing a bunch of "update civicrm_contact" [00:11:59] on the server being sync'ed to? [00:12:14] interesingly it says 1455 seconds behind master [00:12:35] I don't think there was even 1455 seconds of queries that ran [00:13:31] anyway - if it was that I have finished with that & the cron job for that only does 500 rows every half hour [00:14:42] yeah that seems like a weirdly long time [00:15:00] sometimes on staging the delay can be caused by a query on staging [00:15:08] yeah [00:15:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1681 [00:15:18] but i don't see anything [00:15:19] how did it just go up? [00:15:21] all short queries [00:15:27] that's what i'm wondering [00:15:51] so the job that I just ran takes the unsubscribe data & actions unsubscribes in CiviCRM [00:16:10] there are about 100k to do - but that's only because we have never done it before [00:16:46] it took about 10 minutes to run 10k of them (which I tried in a batch & doing that twice obviously triggered this) [00:16:57] the cron is set to 500 per run [00:17:04] (so a lot less than 10k) [00:17:18] and once caught up I expect there would only be a few per run [00:17:51] so this particular job won't be an issue - but there is still a gap in our understanding about replication & replication lag I feel [00:18:07] still, seems weird that 100k queries would punish the db like this [00:18:17] considering the nature of the queries of course [00:19:03] right - & actually it's not even that many - because I ran about 20k of them although perhaps 3-4 queries each so maybe it gets there [00:19:25] what i am seeing a ton of is this "closing tables" operation [00:20:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1916 [00:20:11] https://dba.stackexchange.com/questions/21446/msql-state-closing-tables-taking-double-the-time-of-a-process [00:23:13] cwd that link has some things to check [00:23:33] yep [00:25:11] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2733 [00:25:12] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2095 [00:25:58] well there's 2001 so it's not just frdev [00:26:34] i wonder if we could switch to row based replication [00:27:03] if this has to do with table caching for queries, it seems like not running the queries is the most basic solution [00:27:43] are the queries still the same ones? [00:28:18] now i am seeing civicrm_mailing_provider_data [00:28:25] and a whole lot of "closing tables" [00:29:43] hmm - that closing tables is weird [00:30:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2305 [00:30:20] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2966 [00:30:33] | Opened_tables | 32936 | [00:30:47] number is big (compared to the example) but unchanging [00:31:20] The task that ran for the mailing_provider_data load started at 10 past & ended just under 4 minutes later [00:31:55] the job that ran before the first alert was a little longer but still not that long [00:32:44] " [00:32:44] table_open_cache is the setting that is most likely to get the host off your back. [00:32:44] " [00:33:24] hmm opened tables is a counter - not currently open [00:33:30] https://dba.stackexchange.com/questions/133694/mysql-large-number-of-opened-tables-what-does-this-mean-and-how-can-i-reduce [00:33:41] yeah [00:33:51] i don't get what that means with regards to the cache [00:33:57] we don't have 32k tables [00:34:43] so the fact that it's not changing would imply that everything is cached [00:34:56] reading this one https://dev.mysql.com/doc/refman/5.7/en/table-cache.html atm - seems to help [00:35:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2505 [00:35:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3179 [00:36:04] i am tempted to try `flush tables` on frdev [00:37:45] yeah - what would we record before & after to see that the change is? [00:38:24] eileen: the 2 dbs that are complaining have 32k+ table cache entries, the one that's not (frdb1002) has only 6k [00:38:41] that sounds like a likely culprit [00:38:44] full table cache [00:38:54] not flushing properly for some reason? [00:39:06] yeah - maybe [00:39:16] 32k sounds like a setting [00:39:33] is this the same on all ? [00:39:34] show variables LIKE '%open%';; [00:40:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2701 [00:40:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3424 [00:40:27] yep [00:41:13] it looks like there is an OS setting for how many files can be open at once & that needs to be sync with the table_open_cache value [00:43:14] "The reason you didn't find an answer is that there's not an answer." [00:43:20] that's some good stack overflow right there [00:43:24] :-) [00:43:44] I'm def leaning towards trying that flush command as I read this [00:44:08] eileen: do you know if this is temp table heavy by any chance? [00:44:22] wonder if that might cause problems with table cache [00:44:44] it shouldn't be temp table heavy [00:45:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2885 [00:45:11] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3591 [00:46:11] cwd I think the action that started this is not inconsistent with the sort of thing we do during Big english - it was intense for this time of year - but it's not in the big picture [00:46:39] do you mean vs doing a bulk import? [00:46:56] I mean the query that started this was not part of the big import [00:47:11] it was a pretty standard 'process incoming stuff' job [00:47:21] (albeit using the big import's data) [00:47:42] yeah i am getting the gradual feeling that we are hitting a performance wall that needs a more systemic solution [00:48:19] yeah - I can see the big import as an anomalous load [00:49:06] but, that process was a bit long - but I'm sure we've run jobs for that long last year [00:49:36] yep agreed [00:50:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3808 [00:50:13] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3073 [00:51:15] my worry about `flush tables` is does it terminate queries [00:51:19] the recovery might be slowed by the fact the bigger job kicked in again (it took 3 minutes at 10 past but it looks like it started again at 40 past & that one is a bit bigger & is still running [00:51:20] it says it closes all open tables [00:55:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3247 [00:55:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4038 [00:59:22] eileen: i am not seeing a p-c job running [01:00:03] is it a lot of fast jobs? [01:00:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4212 [01:00:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3480 [01:00:22] no - the process-control job that started at 40 past is still running [01:00:56] doh wrong computer [01:01:03] 10 past only took 3 minutes & there was a longer one earlier today that didn't cause problems [01:01:39] it definitely seemed like the more 'business as usual' job triggered this - but that job is not helping with the recovery [01:01:52] dang yeah there are 8 jobs running [01:02:15] how do you see them - just ps-ef or is there a special function [01:02:21] eileen: which job is the 'business as usual" one? [01:02:22] yep just that [01:02:43] Jeff_Green: so I ran something from the command line earlier that triggered this [01:02:55] hi Jeff_Green -- sorry to activate you, it seems like mysql is drowning [01:02:56] it was a bit longer running than our normal crons [01:03:02] what job was it? [01:03:14] it was a command - it finished about an hour ago [01:03:20] & took under 10 mins [01:03:28] but it got behind off that [01:03:45] Jeff_Green: this command: SHOW GLOBAL STATUS LIKE 'Opened_tables'; [01:03:53] this will recover & I don't need to do anythiing else that will cause probs [01:04:03] shows a much larger number on the complaining boxes than the quiet one (frdb1002) [01:04:14] but, I propose that we plan a time to stop all the queues & then start them all at once [01:04:29] & simulate some more big english like load [01:05:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3710 [01:05:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4437 [01:05:36] Jeff_Green: i was thinking of trying `flush tables` on one of the overloaded slaves [01:05:58] but i am not sure if it will terminate queries and cause a replication problem [01:05:58] what we were seeing was a lot of 'closing tables' queries [01:06:31] PROBLEM - check_minfraud_secondary on payments1004 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [01:06:32] PROBLEM - check_minfraud_secondary on payments1002 is CRITICAL: CRITICAL - Socket timeout after 61 seconds [01:06:43] to be clear the query I ran was not one that would be scheduled to run like that - it's just the nature of it seemed very similar to our normal big english jobs [01:07:53] if flush tables worked it might indicate a mysql setting that needs tweaking [01:08:22] cwd: how did you arrive at flushing tables as a thing to try? [01:08:56] let me check the browser tabs [01:09:32] Jeff_Green: https://dev.mysql.com/doc/refman/5.7/en/table-cache.html [01:10:11] RECOVERY - check_minfraud_secondary on payments1002 is OK: HTTP OK: Status line output matched 200,301,302 - 385 bytes in 0.245 second response time [01:10:12] RECOVERY - check_minfraud_secondary on payments1004 is OK: HTTP OK: Status line output matched 200,301,302 - 385 bytes in 0.246 second response time [01:10:13] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3932 [01:10:13] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4659 [01:11:41] Jeff_Green: that started as an observation that show processlist was full of 'closing tables' & looking up what things play into that [01:11:48] i'm a little puzzled about the open tables thing considering the relatively high limits in the config [01:12:17] i.e. frdev1001 reports 1185 tables open, with open_tables_cache =3000 [01:12:18] it could be a red herring, if 1002 was restarted more recently [01:12:22] since it's just a counter [01:13:02] i mean I'm not opposed to trying it, i'd opt for frdev [01:13:34] like eileen said we were seeing a ton of 'closing tables' queries [01:13:37] query load is very low right now, so I don't know what's causing the reported delay [01:13:41] that seemed to be taking a long time [01:13:58] it says that is closes tables when the cache is full, when it exceeds cache_open_tables & when flush tables is called https://dev.mysql.com/doc/refman/5.7/en/table-cache.html [01:14:05] in fact if you look now you will still see them [01:14:11] (not sure I understand yet which cache) [01:14:15] yeah I saw them [01:14:41] random internet posts say they should not take an appreciable amount of time [01:15:08] eileen: right, but I don't understand why we'd be hitting that limit, we have a total of ~1020 tables open or not, an the cache is 3000 [01:15:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4182 [01:15:20] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4869 [01:15:51] Jeff_Green: I think a table can be open more than once "For example, for 200 concurrent running connections, specify a table cache size of at least 200 * N, where N is the maximum number of tables per join in any of the queries which you execute. You must also reserve some extra file descriptors for temporary tables and files." [01:15:56] https://dev.mysql.com/doc/refman/5.7/en/table-cache.html [01:16:23] although that still doesn't seem to add up [01:16:28] yeah [01:16:40] this would be a lot easier with metrics [01:16:52] all of the queries i've seen have been simple and fast [01:18:18] mysql is really rocking some processors [01:18:20] imo we should focus on two things tomorrow--getting ganglia back to reporting mysql stats, and talk to the DBAs [01:18:32] yeah [01:18:42] yes dbas [01:18:52] Jeff_Green: i did some light reading on mixed replication [01:19:00] the thing that confuses me is the lag is so incredibly much higher than the time taken by the queries in the first place [01:19:10] cwd that is an oxymoron :-) [01:19:16] it sounds like it defaults to statement based except in a few relatively unlikely circumstances [01:19:20] :) [01:19:24] touche! [01:19:29] eileen: yeah, there's something insane going on with that reporting [01:19:46] times when it calculates something non-deterministic, like involving machine time [01:19:54] interesting [01:20:06] so it is probably a lot slower than row based [01:20:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 5071 [01:20:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4399 [01:21:15] i'd like to find out what we might lose by switching to that [01:22:06] I'd be interested to see if we can replicate creating the lag in a controlled effort at a better time of day (for you) [01:23:03] eileen: yeah, perhaps with one of the DBAs helping us look at db activity [01:23:16] since it seems that the query I ran from the command line was not that big & didn't touch that table that we have been hammering - but I guess we could prove it by turning all else off when doing it [01:23:33] yeah having db time seems good [01:23:48] how confident are you that it was this one query? [01:24:16] not 100% [01:24:35] i have not seen any long running queries tonight [01:24:38] but it seemed like I ran it once & all was good. Twice & frdev seemed to get behind [01:24:51] but i did see 8 concurrent p-c jobs for quite awhile [01:24:53] I was a process that basically iterates through incoming changes [01:24:59] it was I mean [01:25:06] & processes them via the civi api [01:25:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4602 [01:25:11] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 5263 [01:25:22] and that is pretty much what our main processes do [01:25:27] still going up, that is too weird [01:25:32] yeah! [01:26:06] at least half the time i see 'closing tables' [01:26:11] hmm [01:26:13] yet the cpus are on fire [01:26:18] we should enable performance_schema [01:26:20] well not on fire but working hard [01:28:04] Jeff_Green: well icould try the flush tables thing [01:28:11] i really don't understand why it's backing up [01:28:39] there are no long running jobs [01:29:10] right & it is 84 minutes behind in processing about 25 minutes of queries [01:30:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 4792 [01:30:20] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 5448 [01:32:55] my current theory is that mysql is doing something internally that doesn't show up in processlist [01:33:23] but afaict we have to enable performance_schema to get more information, and that requires a server restart [01:33:29] Am I seeing this right? [01:33:31] 4947 | closing tables [01:33:43] ie. it implies it has been closing tables for 4947 seconds! [01:33:59] that's frdev? [01:34:10] yep [01:34:54] maybe it really wants to close the table i.e. to refactor the index, but the rate of queries prevents that? [01:35:09] i just stopped slave [01:35:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 5005 [01:35:11] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 5636 [01:35:41] well that didn't help :-) [01:36:44] just looking that closing_tables state query has lots of mysql commands running within it [01:36:52] which is weirder & weirder [01:37:06] but I guess it holds that connection until finished [01:37:55] so the whole query-set (however it groups them) holds a connection & then releases all at once - so the count of how far behind it is is misleading because it will perhaps catch up all at once [01:39:48] what connection do you mean? i thought it was just running what's in the bin log at this point [01:40:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3892 [01:40:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 5864 [01:40:33] Jeff_Green: guessing you stopped slave on frdev1001? [01:41:06] yeah, but i restarted it [01:41:57] just cause that one went down a little [01:42:12] which seems to imply that stuff is still coming from the maste [01:42:13] r [01:42:21] oh that's definitely true [01:42:22] but i am not seeing much p-c activity [01:42:52] so what is it doing? [01:43:05] mytop says it's doing ~1qps [01:43:28] that does not sound like very much [01:43:39] it's doing a whole lot of insert ignore activity [01:44:29] ok yeah [01:44:38] those are those same queries that have been bogging it down [01:44:43] but that job is not running [01:44:51] civicrm_mailing_provider_data [01:45:08] what else is running? [01:45:09] right - it finished about 20 minutes ago but is still syncing [01:45:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3710 [01:45:11] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 6093 [01:45:14] ok [01:45:38] i think mysql is all jammed up [01:47:00] there seems to be a correlation between the time on this process 11472 | system user | | NULL | Connect | 3692 | closing tables [01:47:09] and the time ichinga is telling us about [01:47:33] yes [01:47:43] that is the id of the process i keep seeing [01:48:02] so that is a sync process I guess [01:49:11] what do you mean sync process? [01:49:33] " If this is system user, it refers to a nonclient thread spawned by the server to handle tasks internally. This could be the I/O or SQL thread used on replication slaves or a delayed-row handler." [01:49:43] sync as in replication [01:50:10] PROBLEM - check_mysql on frdev1001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 3767 [01:50:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 6312 [01:52:25] same long running thread has the id 3 on the other server [01:52:27] 2001 [01:53:22] i'm going to restart mysql on frdev1001 to enable the performance schema engine [01:53:39] random corrolation [01:53:54] cool [01:53:55] seconds behind master looks like the same number as the age of the 'closing tables' thread [01:54:10] i guess we'll also see if that stops it thrashing on restart [01:54:32] yep [01:55:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 6555 [01:55:20] PROBLEM - check_mysql on frdev1001 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [01:55:37] ah well that really broke staging :-) [01:55:44] sorry :-) [01:57:35] if i'm reading that right it looks like it caught up already [01:57:41] yep [01:57:47] yeah - what magic [01:58:05] and now it has a whole lot of activity metrics [01:58:23] I keep refreshing to see if it's suddenly gonna realise it has a tonne of work to do & fall behind again… [01:58:34] heh yea [01:58:55] I mean - otherwise the catch up seems weird & indicative of … something [01:59:26] some sort of building state [01:59:45] yeah or something got flushed in the restart in a beneficial way [02:00:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 6766 [02:00:14] RECOVERY - check_mysql on frdev1001 is OK: Uptime: 284 Threads: 5 Questions: 159619 Slow queries: 5 Opens: 289 Flush tables: 1 Open tables: 219 Queries per second avg: 562.038 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [02:01:11] check it [02:01:27] frdb1002 has been running since 7/31, not broken [02:01:42] frdb2001 has been running since 7/11, broken [02:01:46] ah the mysql proc i mean [02:01:55] frdev just restarted and it's ok now [02:02:45] the number of rows in that table are the same on frdev & live [02:02:58] seems to imply something about the local mysql server getting filled up [02:03:06] so it's not like it suddenly decided to ditch the rows - it's like the restart unblocked it [02:03:24] Jeff_Green: mind if i try flush tables on 2001? [02:03:31] unless you already fixed it [02:04:12] go for it [02:04:19] worst case we can restart mysql [02:04:22] yup [02:04:56] eh, did not affect the closing tables thread [02:05:20] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 6965 [02:07:37] Jeff_Green: what if i try killing that thread? [02:07:59] I think restarting the service would be safer [02:08:04] since it worked on staging [02:08:13] yeah [02:08:20] i would just like to know what about restarting fixes it [02:08:25] yeah you don't want to kill the replication thread [02:08:44] I feel like it's a cache that is full & slowing it down - I just can't find something that backs that up [02:09:19] cwd stopping slave stops that thread though [02:09:33] oh ok [02:09:46] and that did not solve the problem [02:10:05] so you restarted it? [02:10:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 7164 [02:10:14] i didn't touch frdb2001 [02:10:22] no me neither [02:10:22] ah ok [02:10:32] but i will if you want [02:10:36] I'm pushing out a config change to enable performance_schema there too, one sec [02:10:40] cool [02:10:42] on staging I did count(*) on the table it was filling after the restart & it matched prod [02:11:02] eileen: like it caught up in no time? [02:11:13] yep [02:11:24] that is what it looked like to me too [02:11:40] got to be a cache type setting [02:14:12] ok i'm going to restart frdb2001 [02:14:52] cool - do you know how behind it is now (before you do it) [02:15:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: Cant connect to local MySQL server through socket /var/run/mysqld/mysqld.sock (2) [02:15:13] last i saw it was a little over 7K sec [02:15:30] ok [02:17:37] it's busy creating temporary tables now [02:18:04] it's like a character in clash of clans [02:18:56] so the side-effect of these restarts is performance data is now collected - but there is another step to make it ganglia visible? [02:19:59] this one does not seem to be catching up in the same hasty fashion [02:20:08] could be network [02:20:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 5506 [02:20:20] yeah, there's a ganglia collector that was working a while back, and it stopped, unrelated to the performance_schema thing, I haven't had a chance to figure out why [02:20:27] it is going down anyway [02:20:35] yeah that's what I was thinking [02:20:38] still degraded raid on that one [02:20:42] after only going up for the last hour [02:21:20] i have a meeting with filippo in the morning to chat about the new monitoring system [02:21:31] i will mention mysql, it's not part of the base stats [02:24:36] cool, they probably have an exporter thinggis for prod by now [02:25:10] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 581 Threads: 3 Questions: 323021 Slow queries: 3 Opens: 442 Flush tables: 1 Open tables: 218 Queries per second avg: 555.974 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [02:26:02] so actually that one did recovery pretty rapidly after the restart too [02:26:24] yep [02:26:28] at least we have a fix [02:26:31] ccccccgdtfcjeivrdfdbfdvuhckekvtbvdefclhgligg [02:26:33] opps [02:26:47] I feel like that tells us something too [02:26:53] just not sure what [02:27:32] could be a bug [02:28:03] mysql thrashing bugs have happened before [02:28:26] i wonder if it's been this all along [02:28:32] the last week or however long [02:28:54] you mean the servers have been tired & in need of a wee refresh [02:29:05] exactly [02:29:26] a little of the old systemctl restart [02:29:44] I'd be surprised if the raid issue has no impact [02:30:15] we'll see if it comes back quickly or not [02:32:01] 'closing tables' could also be a red herring since i saw a bunch of it as 2001 was catching up [02:32:16] yeah [02:32:34] it could be a different cache or buffer that we haven't thought of [02:32:55] yep [02:33:11] sounds like one for a dba [02:34:08] if it is actually a built up state problem we will probably have to wait for it to happen again before it is obvious what is wrong [02:34:21] true [02:36:16] I guess I'll re-enable the schedule of that job, I reduced the time grab, but it seems to have been fine in isolation - it was just the other thing that tipped it over [02:36:58] yeah go for it [02:36:59] !log update process-control to 3d3978ad86e077911e558b986d0635caf2 [02:37:01] i'll be around ish for a bit [02:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:09] let's see if it falls over right away [02:37:23] ok - it will run at 40 past [02:37:48] although since I reduced to 12 hours it probably only really does 'something' every 2 runs [02:38:33] because of donor sleep cycle? [02:38:38] is it regional? [02:39:09] well I think that most of the entries are straight after Caitlin sends [02:39:18] at least half are 'SENT' entries [02:39:48] & then about 80% of the rest are 'OPEN' and obviously she sends at the time she thinks people will open [02:41:16] ah makes sense [02:41:28] it's working on the data now - it's 3.7 MB csv [02:41:56] which is not big in terms of the ones that have hit the db [02:42:06] but, larger than the average half-hourly one [02:45:46] 5 mins in & no slow-warning [02:46:59] the job has completed (took just over 5 mins) [02:48:11] cool, no lag at all [02:48:11] cwd no sign of delay on frdev Slave has read all relay log; waiting for the slave I/O thread to update it [02:48:23] yep, 0 seconds behind [02:48:36] is it the same on frdb? [02:48:42] yep [02:48:48] even the dallas one [02:49:03] seems like the restart unclogged it [02:49:11] I wonder if I should run again via process-control since we are watching [02:49:19] do you have a bunch of data to push still? [02:49:24] go for it [02:51:07] ok - I kicked it in again to see - if this is fine then I think it will probably tick along OK [02:52:01] I don't have anything more I want to try to 'hurry up' so I guess we should try to plan for some load testing after wikimania to make sure there isn't somethink lurking to hurt us [02:52:46] woo, i may have arrived at the bottom of the ganglia rabbithole, but now I'm tired [02:52:51] * Jeff_Green heading out [02:55:34] cwd it has just pulled a larger file - so if this doesn't kill it nothing will [02:56:28] although now I'm not sure what the plan is if it does start to go slow [02:56:43] * cwd rubs hands [02:56:55] since I don't want to disturb Jeff again [02:56:58] i can restart mysql if necessary [02:57:04] ok [02:58:38] this one is 33MB [02:58:54] (must have been a big mail out on that day) [03:00:29] rock and roll [03:01:02] well 5 minutes in & no error on irc [03:01:55] so, this is what I was wondering about - after 5 minutes the slave is still waiting [03:02:25] eileen: yeah it seems like they are staying caught up [03:02:46] which is what i'd expect generally [03:03:20] it's also disproved a theory that I'd been brewing [03:04:00] what was that? [03:04:04] (ie. the because the throttling was at the php level it was not replicating until the php thread ended -) [03:04:28] but the fact the number of rows in the civicrm_mailing_provider_data table is growing on both seems to disprove that [03:05:50] mmm yeah i guess that would depend on how the mysql adapter works? [03:06:21] yeah - but if that were an issue we would see replication not happening until the end of the process & that isn't the case [03:06:27] it's just that it is keeping up now [03:06:42] yeah [03:06:52] does civi use PDO or something custom? [03:08:14] PDO - [03:08:31] I looked at the code & it is not keeping the transaction open [03:10:48] it's like staging is just sitting around waiting now [03:11:42] yep no lag on 2001 either [03:13:24] it's interesting - I thought I'd see some evidence of slow down & speed up with the throttling [03:13:38] but it's just staying at 0 [03:14:38] that is consistent with degrading db performance [03:15:01] you mean degrading prior to the restart? [03:15:19] yeah [03:18:58] the job that is running will only process 5000 every 60 seconds [03:19:51] and it doesn't look like it is getting away at all so far - but perhaps over time it slips [03:20:24] could be [03:20:30] could be a long time though [03:20:50] yeah [03:21:02] this would have to be at the longer end [03:22:13] knowing I can watch the length of that process on frdev rather than waiting for ichinga helps [03:31:36] still no lag [03:32:13] no looks pretty ok huh? [03:32:51] yep no problem [03:46:34] it finished [04:42:34] (PS1) Eileen: Add drush via composer. [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/369593 [04:50:33] (PS2) Eileen: Add drush via composer. [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/369593 [04:51:07] Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Wikimedia-Fundraising-CiviCRM, Patch-For-Review, Unplanned-Sprint-Work: increase email limit from 50 to 700 in civi - https://phabricator.wikimedia.org/T170900#3491997 (Eileenmcnaughton) [04:58:19] (PS3) Eileen: Limit Silverpop group import to Opt In by default [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/368125 [04:58:31] (CR) Eileen: Limit Silverpop group import to Opt In by default (1 comment) [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/368125 (owner: Eileen) [08:13:20] Fundraising-Backlog, MediaWiki-extensions-CentralNotice, MediaWiki-extensions-Translate, Performance-Team, WMDE-Fundraising-CN: WMDE banners failing to save - Timing out on save - https://phabricator.wikimedia.org/T170591#3492148 (Nikerabbit) @AndyRussG @ksmith @ggellerman @Amire80 @Arrbee @K... [10:10:40] l [11:57:26] fundraising-tech-ops, Operations, netops: bonded/redundant network connections for fundraising hosts - https://phabricator.wikimedia.org/T171962#3492728 (mark) No objections from me. It does add complexity somewhat and will probably add some failure modes where both ports fail to come up (LACP failin... [12:39:58] Wikimedia-Fundraising-Campaigns (FY Q1 Jul-Sep 2017), Wikimedia-CentralNotice-Administration (Q3-2017): WMF Fundraising 2017-18 - Israel (Jul 5 - Aug 2) - https://phabricator.wikimedia.org/T165526#3492926 (Pcoombe) Open>Resolved [13:01:59] Good morning MBeat. This week's Big English test just went up [13:02:12] great news, thanks pcoombe :) [14:03:48] Test is over, all looked good from here. [14:08:02] MBeat: I did notice we still have no BPay donations in civi from the test on 12th July. Have we seen any donor comments about that? Do we have someone local who could test? [14:11:02] thanks, pcoombe - that’s strange that none of them settled [14:11:17] no donor complaints, [14:11:30] will ponder on who we could ask to test [14:40:17] (PS1) Ejegg: Update SmashPig library [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/369665 [14:40:26] (CR) Ejegg: [C: 2] Update SmashPig library [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/369665 (owner: Ejegg) [14:44:12] (PS35) Ejegg: Add Ingenico Connect gateway based on GlobalCollect [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/364143 (https://phabricator.wikimedia.org/T163946) [14:44:35] Mepps, how's the connectivity where you are? [14:45:06] also, how would you feel about merging the first Ingenico Connect patch in DonationInterface? [14:45:20] I don't think it's doing anything that will harm other gateways [14:45:39] or has any glaring security flaws [14:46:39] (Merged) jenkins-bot: Update SmashPig library [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/369665 (owner: Ejegg) [15:00:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1337 [15:05:10] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 46181 Threads: 2 Questions: 5348156 Slow queries: 486 Opens: 1792 Flush tables: 1 Open tables: 590 Queries per second avg: 115.808 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [15:53:22] (PS5) Ejegg: Ingenico WX audit parsing [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/368948 (https://phabricator.wikimedia.org/T86090) [16:10:46] Wikimedia-Fundraising-CiviCRM, Continuous-Integration-Infrastructure, Patch-For-Review, Release-Engineering-Team (Kanban): wikimedia-fundraising-civicrm fails with Call to a member function getDriver() on null in phar:///srv/jenkins-workspace/worksp... - https://phabricator.wikimedia.org/T171724#3493781 [16:37:14] (PS6) Ejegg: Ingenico WX audit parsing [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/368948 (https://phabricator.wikimedia.org/T86090) [16:53:45] ejegg I'll take a look, connection is currently spotty but working on a fix, I've been working offline [16:57:42] thanks mepps! [16:57:53] hope you find a good signal [17:04:07] ejegg so the config knows that if it finds a path to another yaml file to open that file? [17:04:40] mepps you mean the symlinks? [17:04:48] yeah [17:05:02] it'll open symlinked files just like any other app [17:05:12] oh i get it [17:05:22] that that's a symlink, in gerrit that's now it appears [17:05:35] ah yeah, the path is not the contents of the file [17:06:13] *not how it appears [17:06:18] got it [17:08:28] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: Spike: should thank you mail send be a queue consumer? - https://phabricator.wikimedia.org/T172300#3494014 (Ejegg) [17:08:31] (CR) Mepps: Add Ingenico Connect gateway based on GlobalCollect (2 comments) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/364143 (https://phabricator.wikimedia.org/T163946) (owner: Ejegg) [17:08:51] only comment there needs a response ejegg ^^ [17:09:17] thanks mepps, looking [17:11:08] (CR) Ejegg: Add Ingenico Connect gateway based on GlobalCollect (1 comment) [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/364143 (https://phabricator.wikimedia.org/T163946) (owner: Ejegg) [17:13:46] also ejegg we won't need the orphan adapter for ingenic but it's not a big deal for this patch [17:13:54] ingenico [17:14:04] (CR) Mepps: [C: 2] Add Ingenico Connect gateway based on GlobalCollect [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/364143 (https://phabricator.wikimedia.org/T163946) (owner: Ejegg) [17:14:10] oh yeah, I'll zap that stuff [17:14:17] (the orphan stuff) [17:14:25] thanks for the CR! [17:17:36] (Merged) jenkins-bot: Add Ingenico Connect gateway based on GlobalCollect [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/364143 (https://phabricator.wikimedia.org/T163946) (owner: Ejegg) [17:25:13] RECOVERY - check_raid on frdb2001 is OK: OK: HPSA [P420i/slot0: OK, log_1: 3.3TB,RAID1+0 OK] [17:25:19] fr-tech any news for scrum of scrums? [17:28:51] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: CiviCRM dedupe jobs should gracefully time out - https://phabricator.wikimedia.org/T172303#3494111 (Ejegg) [18:04:54] Fundraising-Backlog, Spike: spike: New email risk scoring enhancements - https://phabricator.wikimedia.org/T172310#3494319 (DStrine) [18:49:13] (PS7) Ejegg: Ingenico WX audit parsing [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/368948 (https://phabricator.wikimedia.org/T86090) [18:56:59] (PS14) Ejegg: [WIP] Ingenico WX audit module [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/199653 (https://phabricator.wikimedia.org/T86090) (owner: Katie Horn) [18:57:16] ^^^ Really close to done [18:58:39] ejegg: Really? [18:58:43] (CR) jerkins-bot: [V: -1] [WIP] Ingenico WX audit module [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/199653 (https://phabricator.wikimedia.org/T86090) (owner: Katie Horn) [18:58:53] That's been hanging over me for... years. :/ [18:59:11] I was so close before I got sucked into the management vortex. [19:00:02] I expect people could study the beginnings of that WIP to measure effects of the management event horizon. [19:06:33] hehe [19:07:27] K4-713: there was finally a good reason to expedite it - they apparently stopped adding manually-approved txns to the wr1 files a few months back [19:10:37] Fundraising-Backlog, fundraising-tech-ops: Can't access CiviCRM - https://phabricator.wikimedia.org/T172233#3494626 (DKaufman) I am using Chrome. As for OS, I have V 10.11.6. This has been happening since I've been remote -- starting last week. In case it's useful, I also cannot connect to the Wikimedia... [19:16:31] Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Wikimedia-Fundraising-CiviCRM, Patch-For-Review, Unplanned-Sprint-Work: increase email limit from 50 to 700 in civi - https://phabricator.wikimedia.org/T170900#3494635 (DKaufman) Hi, I can't test it until I am able to acce... [19:22:38] (PS1) Ejegg: Merge branch 'master' into deployment [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/369715 [19:22:40] (CR) Ejegg: [C: 2] Merge branch 'master' into deployment [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/369715 (owner: Ejegg) [19:23:33] Fundraising Sprint Navel Warfare, Fundraising-Backlog, FR-Email, Patch-For-Review: Leave blank values in Silverpop file blank, not NULL - https://phabricator.wikimedia.org/T91692#3494645 (Ejegg) Open>Resolved [19:23:42] (Merged) jenkins-bot: Merge branch 'master' into deployment [wikimedia/fundraising/SmashPig] (deployment) - https://gerrit.wikimedia.org/r/369715 (owner: Ejegg) [19:23:48] Ejegg still working on setting up a better wifi solution, can you call me into the retro? [19:33:22] mepps: what's your number? wanna send it via gchat? [19:34:41] Fundraising Sprint Navel Warfare, Fundraising-Backlog, MediaWiki-extensions-DonationInterface, MW-1.30-release-notes (WMF-deploy-2017-08-01_(1.30.0-wmf.12)), Patch-For-Review: Compound order IDs should never end in ".0" - https://phabricator.wikimedia.org/T171891#3494709 (Ejegg) Open>R... [19:34:46] Fundraising Sprint Navel Warfare, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-Email, Patch-For-Review: Import email-only contacts from 'remind me later' links into CiviCRM - https://phabricator.wikimedia.org/T160949#3494711 (Ejegg) [19:34:48] Fundraising Sprint Navel Warfare, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-Email, Patch-For-Review: Make sure silverpop export does not add default date for last_donation date - https://phabricator.wikimedia.org/T171920#3494710 (Ejegg) Open>Resolved [19:34:56] Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, and 2 others: "Thank-you Sent" field auto populating date - https://phabricator.wikimedia.org/T170440#3494712 (Ejegg) Open>Resolved [19:35:05] Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising-Backlog, FR-PayPal-ExpressCheckout, and 2 others: Stop using kludgy workaround for PayPal EC recurring - https://phabricator.wikimedia.org/T170478#3494714 (Ejegg) Open>Resolved [19:35:20] Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Wikimedia-Fundraising-CiviCRM, Patch-For-Review, Unplanned-Sprint-Work: increase email limit from 50 to 700 in civi - https://phabricator.wikimedia.org/T170900#3494715 (Ejegg) Open>Resolved [19:35:37] Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, and 4 others: PayPal EC dead session error - https://phabricator.wikimedia.org/T167923#3494716 (Ejegg) Open>Resolved [19:40:40] Fundraising Sprint Navel Warfare, Fundraising-Backlog, FR-PayPal-ExpressCheckout, Patch-For-Review: Paypal EC subscr start messages not recording amount - https://phabricator.wikimedia.org/T171546#3494737 (Ejegg) Open>Resolved p:Triage>Normal [19:47:31] Fundraising Sprint Navel Warfare, Fundraising-Backlog, Patch-For-Review, Unplanned-Sprint-Work: Are we losing transactions witih repeated ct_id? - https://phabricator.wikimedia.org/T171349#3494766 (Ejegg) I still see zero things tagged with 'DuplicateInvoiceId', but that might be because we legit... [20:10:24] Fundraising Sprint Navel Warfare, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-Email, Patch-For-Review: Import email-only contacts from 'remind me later' links into CiviCRM - https://phabricator.wikimedia.org/T160949#3494867 (Eileenmcnaughton) [20:20:17] Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, MediaWiki-extensions-DonationInterface: Create orphan rectifier for PayPal Express Checkout - https://phabricator.wikimedia.org/T172202#3494917 (DStrine) [20:20:20] Fundraising Sprint O 2017, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-Ingenico: Deal with recurring donations stuck in 'In Progress' status - https://phabricator.wikimedia.org/T171868#3494919 (DStrine) [20:20:23] Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, MediaWiki-extensions-CentralNotice, Patch-For-Review: CentralNotice: JS selection widgets no longer work in interface to add a campaign - https://phabricator.wikimedia.org/T172023#3494918 (DStrine) [20:20:27] Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, Patch-For-Review: Populate country column when creating c_t rows during offline import - https://phabricator.wikimedia.org/T171658#3494920 (DStrine) [20:20:30] Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, Patch-For-Review: Strict error message in logs (quick tidy up) - https://phabricator.wikimedia.org/T171560#3494921 (DStrine) [20:20:32] Fundraising Sprint Gondwanaland Reunification Engine, Fundraising Sprint Homebrew Hadron Collider, Fundraising Sprint Ivory Tower Defense Games, Fundraising Sprint Judgement Suspenders, and 9 others: Drush not handling spaces in quotes / schedule Si... - https://phabricator.wikimedia.org/T171435#3494922 [20:20:34] Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, FR-PayPal-ExpressCheckout, and 2 others: Are PayPal refunds for recurring donations incorrectly being tagged as EC or vice versa? - https://phabricator.wikimedia.org/T171351#3494923 (DStrine) [20:20:36] Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, Patch-For-Review, Unplanned-Sprint-Work: Are we losing transactions witih repeated ct_id? - https://phabricator.wikimedia.org/T171349#3494924 (DStrine) [20:22:08] Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, Fr-Ingenico-integration_17_18: Customize Ingenico hosted checkouts iframe - https://phabricator.wikimedia.org/T171346#3494925 (DStrine) [20:22:10] Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, and 3 others: Optimize filesystem access in thank_you email job - https://phabricator.wikimedia.org/T170435#3494927 (DStrine) [20:22:12] Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, and 3 others: CentralNotice: Add controls to purge banner content in Varnish for a specific language - https://phabricator.wikimedia.org/T168673#3494929 (DStrin... [20:22:14] Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, and 5 others: Resultswitchers: send straight to ty page on reload - https://phabricator.wikimedia.org/T167990#3494930 (DStrine) [20:22:16] Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, and 3 others: process-control should not crash on bad utf-8 from stdout or stderr - https://phabricator.wikimedia.org/T167849#3494931 (DStrine) [20:22:19] Fundraising Sprint O 2017, Fundraising-Backlog, MediaWiki-extensions-DonationInterface: Update Mastercard logo - https://phabricator.wikimedia.org/T166795#3494932 (DStrine) [20:22:21] Fundraising Sprint Judgement Suspenders, Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, and 3 others: Compile list of systems we depend on and should check all the time - https://phabricator.wikimedia.org/T163509#3494936 (DStrine) [20:22:23] Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, and 4 others: Implement Ingenico Connect API calls to get payment status - https://phabricator.wikimedia.org/T163948#3494933 (DStrine) [20:22:26] Fundraising Sprint Ivory Tower Defense Games, Fundraising Sprint Judgement Suspenders, Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, and 8 others: Move already-encapsulated useful classes into SmashPig - https://phabricator.wikimedia.org/T163868#3494935 (DStrine) [20:22:28] Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, and 5 others: Implement Ingenico Connect API call to create hosted payment - https://phabricator.wikimedia.org/T163946#3494934 (DStrine) [20:22:30] Fundraising Sprint O 2017, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM: Allow use of Civimails - https://phabricator.wikimedia.org/T162747#3494937 (DStrine) [20:22:32] Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, and 2 others: Import email-only contacts from 'remind me later' links into CiviCRM - https://phabricator.wikimedia.org/T160949#3494939 (DStrine) [20:22:34] Fundraising Sprint Judgement Suspenders, Fundraising Sprint Kickstopper, Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, and 5 others: retrieve lists of contacts who received a particular mailing - https://phabricator.wikimedia.org/T161762#3494938 (DStrine) [20:22:37] Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, and 4 others: Merge CRM and DonationInterface queue wrappers - https://phabricator.wikimedia.org/T95647#3494941 (DStrine) [20:22:42] Fundraising Sprint Baudelaire Bowdlerizer, Fundraising Sprint Costlier Alternative, Fundraising Sprint Deferential Equations, Fundraising Sprint English Cuisine, and 15 others: Purge Varnish cache when a banner is saved - https://phabricator.wikimedia.org/T154954#3494940 (DStrine) [20:22:45] Fundraising Sprint Murphy's Lawyer, Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, and 2 others: Clone button for CN campaigns - https://phabricator.wikimedia.org/T91078#3494942 (DStrine) [20:22:48] Fundraising Sprint Lou Reed, Fundraising Sprint Navel Warfare, Fundraising Sprint O 2017, Fundraising-Backlog, and 8 others: Update GC audit to read WX file - https://phabricator.wikimedia.org/T86090#3494943 (DStrine) [20:41:07] Fundraising Sprint Judgement Suspenders, Fundraising Sprint Kickstopper, Fundraising Sprint Loose Lego Carpeting, Fundraising Sprint Murphy's Lawyer, and 2 others: Compile list of systems we depend on and should check all the time - https://phabricator.wikimedia.org/T163509#3495002 (Ejegg) Ope... [20:57:24] (PS15) Ejegg: Ingenico WX audit module [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/199653 (https://phabricator.wikimedia.org/T86090) (owner: Katie Horn) [20:57:42] K4-713: removed the WIP ^^^ [20:57:52] Aaaaaagh! [20:57:57] it'll fail in CI, but that's just 'cause it depends on the SmashPig patch [20:58:18] This is a momentous day. [20:58:46] if you feel like doing some CR for old times' sake: https://gerrit.wikimedia.org/r/368948 [20:58:53] (CR) jerkins-bot: [V: -1] Ingenico WX audit module [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/199653 (https://phabricator.wikimedia.org/T86090) (owner: Katie Horn) [20:59:59] Oh, I don't know that I'd be usefully able to +2 anything, but this should be informative. [21:00:20] hmm, those SmashPig audit parsers should have an interface [21:01:04] then the CRM tests could just mock 'em and be more unit-y [21:03:19] Oooh. [21:16:01] as much as I hate to replace python code with PHP, I feel like the PayPal audit parser belongs there too [21:16:33] (PS1) Ejegg: Add AuditParser interface, other audit cleanup [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/369748 [21:19:05] ooh, bumping drush from v5 to v8 [21:19:13] guess that merits some local testing :) [21:20:39] re: python-> PHP: Yeah, I mean, eventually we're going to have to be consistent if we want anyone else to be able to use this stuff. [21:20:50] ...and still want to talk to us. [21:20:53] hehe [21:24:20] ejegg: how are you going to use the new drush? hard code in the job? [21:25:39] cwd yeah, just replace the /usr/local/bin/drush, I guess [21:25:55] unless you want to symlink it someplace for convenience! [21:26:12] We're deploying to a predictable location [21:26:28] it'll just be /vendor/bin/drush [21:26:38] ejegg: yeah, i think i should probably flip that wrapper script [21:26:45] cool [21:27:04] I'll at least try running through all the scheduled jobs locally first [21:27:51] sounds good [21:28:53] Jeff_Green: fr tech wishes to switch to composer drush, it is more up to date that apt (surprise), seems sensible to me as it is basically part of civi [21:29:14] are you cool with that? [21:29:31] possibly, but possibly not [21:30:11] need to ponder the implications, since drush is directly wrapped by sudo and running with a fair amount of privileges [21:30:51] we need to solve the problem of an audit mechanism for composer installed stuff [21:31:09] that is true [21:31:22] it runs as www-data which is a lot of stuff [21:31:29] yup [21:31:39] using composer at all bothers me [21:31:53] or i should say doesn't sit well [21:32:08] package management is hard [21:32:25] let's start with a phabricator task for the discussion [21:32:28] yeah, would be nice to have the packages signed anyway [21:32:29] in an age when you are not expected to flinch at curl github.com/project | sh [21:33:01] there's a phab task here: https://phabricator.wikimedia.org/T171435 [21:33:50] I wonder if there's a newer version in backports [21:34:07] nope :-( [21:37:24] well we can always package it [21:37:36] i would like to spend some time streamlining my process for debs [21:39:04] is it just a flat php file? [21:39:32] it is actually a huge package [21:39:41] ugh [21:40:26] there is not even a candidate in buster [21:40:52] ok. let's put this on the agenda for tomorrow [21:40:54] $ dpkg -L drush | wc -l [21:40:55] 128 [21:41:11] (CR) Ejegg: [C: 2] "Runs our schedule commands locally with no issues. Merging this now and we can update the wrapper script in production when ops are satisf" [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/369593 (owner: Eileen) [21:41:37] ejegg: so wait, you are going to run drush as not www-data now? [21:41:44] it would be good to come up with a scheme we can reuse for similar cases, and think about how we're using composer and whether it's a complete nightmare to stop using it [21:42:23] is this just b/c of our drush wrapper? [21:42:26] cwd no, just merging the package to get drush8 in vendor/bin/ for us all to use in dev. Not switching anything in prod yet [21:42:35] ah right on [21:42:57] does the wrapper eat the quotes? [21:43:12] good question [21:43:20] has anyone tried running it without the wrapper? [21:43:31] (CR) Ejegg: [C: 2] Limit Silverpop group import to Opt In by default [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/368125 (owner: Eileen) [21:43:44] Note that the key benefit of upgrading drush IS for the process-control to use the upgraded one [21:44:23] Jeff_Green: could the drush wrapper rather than the drush version be the reason it doesn' cope with spaces within quotes? [21:44:46] ie drush vset name="look more than one word" [21:44:50] i think that is what he is theorizing [21:44:57] is treated as drush vset name="look" [21:44:59] is that easy to test? [21:45:04] that's what I'm wondering, yeah [21:45:05] yep [21:45:24] you can just run it via sudo instead [21:46:00] I don't think I can (or should) do sudo [21:46:18] that's all the wrapper does [21:46:37] are you running this on dev or civi? [21:46:55] on civi [21:46:57] so locally [21:46:58] drush vset name "two words" [21:46:59] name was set to "two words". [21:47:11] (Merged) jenkins-bot: Add drush via composer. [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/369593 (owner: Eileen) [21:47:21] ok so that's not it, that's good to know [21:47:49] On staging [21:47:50] drush vset name "two words" [21:47:51] name was set to "two". [21:48:06] (same on prod ) [21:48:32] !log updated SmashPig from f4ca53ca303a7c9cbc4f0fd15d62116f0689b81d to c501f53aa0394a8f99f4b9fd87b3c8b8294511e0 [21:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:00] now I'm confused [21:49:41] (Merged) jenkins-bot: Limit Silverpop group import to Opt In by default [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/368125 (owner: Eileen) [21:49:49] I'm always confused, so welcome to my world :-) [21:49:58] can you re-summarize what happens where? [21:50:01] wat [21:50:20] OK - so correct normal behaviour for drush (which works locally & a bunch of other places) [21:50:27] from drupal dir [21:50:28] are you using the wrapper locally? [21:50:29] drush vset name "two words" [21:50:41] cwd no - not using the wrapper - just using drush [21:50:48] ah so the wrapper must be it [21:50:59] well also locally I'm using a recent drush version [21:51:08] the one on dev / prod is really old [21:51:12] like years [21:51:33] drush version 5.10.0 [21:51:50] locally I have 7 [21:52:13] but anyway the good thing is it's easy to test on staging [21:52:15] eileen: try running the script directly on staging? [21:52:25] you can sudo that, it's a whitelisted command [21:52:31] as in find the path to the actual drush command? [21:52:31] the wrapper script is just convenience [21:52:41] cat /usr/local/bin/drush [21:52:45] sudo -u www-data /usr/share/php [21:52:56] /drush/drush.php [21:53:39] hp /usr/share/php/drush/drush.php vset name "two words" [21:53:39] name was set to "two words". [21:53:46] so that worked [21:54:14] something tells me this is going to get fiddly [21:54:20] what is the $@ at the end of the line? [21:54:37] is that something to do with passing arguments? [21:54:48] yeah it's all your arguments [21:54:53] in bash [21:55:09] so the command is sudo -u www-data /usr/share/php/drush/drush.php $@ [21:55:17] I ran it with php first [21:55:37] removing php didn't change it [21:55:56] but I wonder if that $@ is passing the args in a way that doesn't parse quotes the same [21:56:04] it's got to be something to do with how $@ expands [21:57:21] yeah it's a bash snakepit alright [21:57:39] cwd I think this is it https://stackoverflow.com/questions/3811345/how-to-pass-all-arguments-passed-to-my-bash-script-to-a-function-of-mine?answertab=votes#tab-top [21:58:01] (maybe) [21:59:26] so i think that will quote the args [21:59:32] but they are already quoted right? [21:59:52] do you do drush vget to see what you set? [21:59:59] i'm going to punt for the night, i gotta make dinner [22:00:06] cool, have a good one Jeff_Green [22:00:12] you too! [22:01:34] (PS1) Ejegg: Fix 'Only variables should be passed by reference' [wikimedia/fundraising/SmashPig] - https://gerrit.wikimedia.org/r/369803 (https://phabricator.wikimedia.org/T171560) [22:03:47] cwd sorry dropped out due to puter issue - back [22:04:34] the args are quoted as I pass them in - but the quotes are mishandled by the time it reaches drush [22:04:54] yeah... let me see if i can find the right incantation [22:05:40] ok - if you want to test you just need to run that command from the drupal dir & see the response [22:05:55] gotcha, ty [22:19:12] eileen: trying to test locally, puppet catastrophe [22:19:25] cwd sigh [22:24:50] eileen: "$@" did seem to work for "two words" [22:24:55] cwd cool [22:25:03] i'll edit the wrapper to have that [22:25:12] well [22:25:20] it would be nice to test that it doesn't blow up any jobs [22:25:50] cwd - yeah - I don't know what to do other than deploy & monitor [22:39:50] eileen: ok the quotes are live [22:39:56] sorry that took so long [22:41:11] cwd cool [22:41:29] works on staging [22:43:25] cwd looks all good - do you prefer we now revert adding drush into composer? [22:43:43] yeah it sounded like jeff is not really down [22:44:01] but it's not about updating drush, just composer being ridiculous from a security standpoint [22:44:03] (PS3) Ejegg: Update DonationInterface lib [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/368848 (https://phabricator.wikimedia.org/T95647) [22:44:21] oh, so I shouldn't have rebased that patch on top of yours? [22:44:27] heh, no worries [22:44:42] I was a bit quick on the trigger with the +2 there [22:45:04] there is no reason packaging our own stuff should be hard [22:45:27] ejegg: so not quite a revert- but I'll repatch now. BTW I still get that composer timestamp flapping & I'm on latest [22:45:37] jeez, that's weird [22:45:44] except that making debian packages is terrible [22:45:45] is it a mac thing? [22:45:52] I could only find that one comment about the timestamp format & someone said 'yeah but it's not in the changelog' [22:45:57] maybe it is mac [22:46:09] well, I'mma head out here [22:46:17] ok [22:46:56] that WX parsing is all set for review, btw - Civi patch just depends on the SmashPig patch merging [22:47:39] so it's easy for us to manage installing debs because it's the paradigm [22:47:52] but from what i can tell that's the only real benefit [22:48:32] apt is great because of the debian web of trust but our own debs do not benefit from that [22:53:13] although maybe there's an argument against making it easier to use less vetted code [23:22:00] eileen: i am going to head out shortly, anything you can predict needing? [23:22:09] cwd no - I think we are good! [23:22:21] I'm not going to be changing what is running today [23:22:26] super [23:22:32] & there have been no hiccups since server restart [23:22:56] yep [23:23:13] a fix i don't understand is better than no fix :) [23:23:52] right - I raised it today at sprint planning that we should try to break it again in a more controlled way after wikimania [23:31:01] sounds good