[02:23:03] Wikimedia-Fundraising, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Wikipedia-iOS-App-Backlog, and 2 others: Run Big English fundraising on apps - https://phabricator.wikimedia.org/T181004#3839355 (bearND) [02:23:27] Wikimedia-Fundraising, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Wikipedia-iOS-App-Backlog, and 2 others: Run Big English fundraising on apps - https://phabricator.wikimedia.org/T181004#3776010 (bearND) Done, see T182802. [15:22:42] fr-tech I want to use these updates to log non-200 responses from the Amazon SDK: https://github.com/ejegg/login-and-pay-with-amazon-sdk-php/tree/debugpauses [15:22:49] Anybody see any problems? [15:23:03] for background: we're already using a fork of the SDK [15:23:16] that adds a reporting client [15:23:42] since most of the underlying post / decode logic is the same between payments and reporting calls [15:23:42] that link takes me to the readme [15:23:50] ejegg [15:24:04] jgleeson: ah, it's the last 2 commits on that branch [15:24:16] https://github.com/ejegg/login-and-pay-with-amazon-sdk-php/commits/debugpauses [15:28:23] looks good to me. I read the conditional in invokePost as "if not 200,500 or 503" then log [15:30:40] although I now see that the 500/503 code path triggers a perpetual retry which eventually results in a 200 or non-200 result [15:34:05] jgleeson: wait, really? [15:34:48] ah, no, pauseOnRetry actually throws an exception when we get past MAX_RETRYS [15:35:19] ejegg i see some a white spacing issue in BaseClient.php line 50 though not as big a deal [15:35:33] so much whitespacing issues... [15:47:13] ejegg, sorry yes it does max out within pauseOnRetry [15:55:10] ok, I'll merge those to my fork's dev-master and pull that update into DonationInterface [16:04:59] mepps I keep trying to get WS fixes merged upstream as a precursor to upstreaming the ReportClient stuff [16:05:02] https://github.com/amzn/amazon-pay-sdk-php/pull/63/files [16:05:16] but they're super-unresponsive [16:05:16] (PS1) Jgleeson: Updated composer package to stats-collector to version 1.0.0 [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/398495 [16:07:44] oh hey, dude actually commented on the PR that time! [16:08:16] * ejegg crosses fingers [16:08:32] ejegg nice, also see that you introduce one too in line 50 though [16:09:13] ah, yeah, I'll have to completely redo the reportClient stuff for 3.2.0 before I try upstreaming it [16:09:57] just waiting on the whitespace standardization first, so I can submit the rest as intelligible chunks [16:11:26] the logger stuff is actually already implemented upstream [16:11:59] (though annoyingly, they've just copied Psr\Log right into their lib instead of listing it as a dependency) [16:15:11] oh weird [16:21:29] hi AndyRussG :) [16:23:23] jgleeson: morning! :) [16:25:03] (PS1) Ejegg: Update Amazon SDK fork for logging retries [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/398500 (https://phabricator.wikimedia.org/T182735) [16:25:08] just logging the Amazon, don't be alarmed ^^^ [16:25:29] i am having the weirdest time with dash ejegg, it's for some reason not registering my code changes [16:26:26] mepps are you running it as an actual service, or just using 'node server.js -d' from the command line each time? [16:26:47] node server.js but i restart it when i make code changes [16:27:10] ooh i finally got the error i was looking for! phew [16:39:31] okay ejegg i just had to find where data.js was used [17:04:51] fr-tech sorry, net issues [17:09:38] fr-tech, i caught the internet flu too now! [17:09:40] and you're not even in the country that just sold out to telecoms [17:09:50] oops, meaning ejegg not mepps :) [17:27:00] fundraising-tech-ops, monitoring, Epic: overhaul fundraising cluster monitoring - https://phabricator.wikimedia.org/T91508#3841020 (cwdent) [18:15:31] !log rolled back to civicrm to 798e2467 to investigate prometheus bug [18:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:06] Fundraising-Backlog, fundraising-tech-ops: add primary keys or unique indexes to some tables in civicrm, drupal, and pgehres databases - https://phabricator.wikimedia.org/T176631#3841274 (Jgreen) p:Triage>Low Clearly this was not the cause of the replication lag issues that originally prompted me... [18:32:41] Fundraising-Backlog, fundraising-tech-ops, Operations: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3841291 (Jgreen) [18:32:43] Fundraising-Backlog, fundraising-tech-ops, Operations, Spike: Spike: Enumerate remaining unported stats - https://phabricator.wikimedia.org/T175850#3841288 (Jgreen) Open>Resolved a:Jgreen We have fundraising grafana dashboards now, that cover the stuff we care about. [18:37:22] ? [18:38:10] [redacted] [19:14:29] mepps, jgleeson_ hi! [19:14:45] hi cwd! [19:15:01] hi cwd :) [19:16:05] how goes? we're seeing some interesting prometheus behavior that might be php related [19:16:34] jgleeson you said you were looking at something related to this? [19:16:44] yep [19:17:11] https://grafana.wikimedia.org/dashboard/db/fundraising-host-overview?panelId=21&fullscreen&orgId=1&var-server=civi1001.frack.eqiad.wmnet&var-datasource=frack.codfw%20prometheus&from=now-6h&to=now&refresh=5m [19:17:36] looks like some stuff has not been reporting lately...haven't nailed down what [19:17:37] I rolled back about an hour ago a release from last night that was creating erroneous duplicate metric data [19:17:41] cwd [19:17:55] ah ha, would that have been the donations.prom file? [19:17:58] I think it was to do with that [19:18:00] yes [19:18:04] the new addition [19:18:23] great, we blew the dir away and it came back to life so were wondering if it was something like that [19:18:27] thanks and carry on! [19:18:45] we added a bunch of new averages and overalls but the individual stats that were also output [19:19:16] annoyingly it doesn't fail when prometheus scrapes that data directly as a target [19:19:31] so I am guessing the node exporter is failing less gracefully [19:19:51] which might explain the lack of data entirely for the period [19:21:17] my local testing consists of a local install of Prometheus scraping the output text files served by a local PHP server, and it seemed to work using that but as it's not a true reflection of production I suspect the node exporter gap is where it's failing [19:22:23] huh i was assuming it was the server scrape that failed [19:22:27] cause i saw the file on disk [19:22:40] Jeff_Green noticed it had some repeating keys [19:24:19] so they scrape locally for me still works [19:24:38] but there is a warning in the console log indicating the duplicate key [19:25:26] as in, new data is scraped correctly for non-duplicated keys [19:25:29] what console is that? i am trying to find an error message [19:25:34] like stdout? [19:25:39] yeah [19:25:46] running ./prometheus [19:25:49] with a config.file [19:25:58] hrm, bet cron eats it [19:26:41] is the node exporter consuming the .prom files and serving them itself? [19:26:53] or wait, it is its own cron [19:28:06] jgleeson_: in prod the load balancers are running the service that scrapes the servers [19:28:10] gets those .prom files over http [19:28:47] grafana.wm.o polls pay-lvs for the data to graph [19:32:00] it looks like the prometheus node exporter uses a textfile collector to suck up the .prom files and makes it available at the instance/metrics url [19:33:10] I'm guessing it's failing somewhere around there when a duplicate key is encountered [19:33:22] (PS7) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 [19:34:07] (CR) Mepps: "This is still WIP" [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps) [19:34:16] (CR) jerkins-bot: [V: -1] Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps) [19:34:24] jgleeson_: ah actually the textfile collector is sort of the same thing as the node exporter [19:34:32] they both write .prom files to the web root [19:34:55] ah I see [19:34:59] the node exporter is a bunch of default metrics [19:35:03] so it doesn't serve the metrics [19:35:16] it just generates them from other sources? [19:35:21] we use the text file collector to expose arbitrary stuff [19:35:31] yeah you can think of the node exporter as "about this computer" [19:35:48] ok that makes more sense [19:35:58] and then it has some golang webserver that runs out of the spool dir [19:36:19] so the load balancers have a list of hosts and ports to ask for .prom data [19:36:35] they in turn deliver it to grafana for pretty pictures [19:37:12] (PS8) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 [19:37:15] for a good time you can tunnel to :9090 on pay-lvs* and see the prometheus interface [19:38:24] (PS9) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 [19:38:31] do you think it's possible to create a test environment locally to simulate production? [19:38:44] so I can see where it's falling over specifically [19:39:17] jgleeson_: possible, sure, but a lot of work :) [19:39:33] Jeff_Green and i use virtualbox locally [19:39:35] (CR) jerkins-bot: [V: -1] Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps) [19:39:54] I'm running the mediawiki vm [19:40:04] but for the prometheus stuff I just set it up locally [19:40:05] in order to simulate frack in any useful way you need pretty comprehensive networking support [19:40:29] (PS10) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 [19:40:52] but we can narrow this problem down [19:41:08] I think the only thing I can do at this point is patch the known bug and release it again until I can work out a way to do a production-like test [19:41:11] i'm pretty sure it's in the scraping of the data by the load balancer, because the offending file was on disk [19:41:17] with the duplicate keys [19:41:20] (CR) jerkins-bot: [V: -1] Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps) [19:41:59] I'll hold off until next week now, don't want anything to break over the weekend [19:42:05] or heck it could even be at grafana [19:43:40] cwd I saw a bug on the textfile exporter that looked relevant [19:44:09] ah, no, red herring [19:46:08] https://github.com/prometheus/client_java/issues/273 [19:46:11] maybe? [19:48:19] ejegg: oh i think jgleeson_ figured out what it was [19:48:33] just interested to see where exactly it was falling over now [19:49:13] jgleeson_: i sort of misspoke about node exporter, the web server is contained within that process [19:49:25] it was ejegg who spotted the duplicates, I was then trying to track down where it was falling over as I can't reproduce the failure locally [19:49:32] ah ok cwd [19:49:54] argh and it does not write prom files [19:50:02] but it does export a bunch of data, not as a prom file [19:50:05] jgleeson_: so since prometheus is running on your local machine it can read the files directly? [19:50:16] no [19:50:18] And reading directly has no problem with the duplicates? [19:50:22] I have to serve them over http [19:50:24] so it could be the exporter process choking on the bad file [19:50:33] so I have been using a simple php server [19:50:37] outputting the content [19:50:54] which was working fine from a frontend perspective [19:51:12] but the actually prometheus binary stdout is showing an error [19:51:14] ah, cool, and the php server sends out the files exactly as written, with no filtering? [19:51:23] yep exactly as [19:51:30] file_get_contents() [19:51:46] I was just sending you an email write up [19:51:56] I'll cc cwd in [19:52:05] groovy [19:52:11] :) [20:10:26] ejegg I'm doing phab cleanup and came across this one https://phabricator.wikimedia.org/T176295 -- do you think these queues are worth reporting given the other queue reporting we have now? [20:11:46] oh, interesting [20:12:02] i'm not sure where those are added to a queue [20:12:15] but they seem like good numbers to have [20:13:19] i'm not either, but I think I can just recycle the query nagios is using to drop a prom file [20:13:58] fr-tech I am signing off for the weekend. Have a great weekend and see you Monday o/ [20:14:10] have a good weekend jgleeson_! [20:41:37] Fundraising-Backlog, Support-and-Safety: Revoke centralnotice-admin for Awight (WMF) - https://phabricator.wikimedia.org/T168428#3841667 (jrbs) [20:44:24] Fundraising-Backlog, Epic: [Epic] Revoke AWight fundraising privileges - https://phabricator.wikimedia.org/T168421#3841690 (jrbs) [20:44:26] Fundraising-Backlog, Support-and-Safety: Revoke centralnotice-admin for Awight (WMF) - https://phabricator.wikimedia.org/T168428#3841688 (jrbs) Open>Resolved Rights removed: ``` 2017-12-15T20:43:03 JSutherland (WMF) (talk | contribs | block) changed group membership for Awight (WMF) from central... [20:55:16] Fundraising-Backlog, fundraising-tech-ops: add primary keys or unique indexes to some tables in civicrm, drupal, and pgehres databases - https://phabricator.wikimedia.org/T176631#3841697 (jcrespo) @Jgreen My recommendation, in preparation **for next year**, would be to try to research this and ROW migrat... [21:04:03] hey ejegg my patch is working now but ci is rejecting it because of the word catch [21:04:26] mepps oh we should be able to suppress that [21:04:36] lemme see, I think we do that for 'delete' someplace [21:05:29] mepps see server.js line 108 [21:05:39] /*jslint -W024*/ [21:06:08] tells jslint to stop worrying about re-used keywords [21:06:45] ohh i get it [21:08:00] (PS11) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 [21:09:00] (CR) jerkins-bot: [V: -1] Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps) [21:10:36] dang, do we need a different code [21:10:37] ? [21:11:01] ah, no, just more style pickiness [21:11:14] (PS12) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 [21:12:10] (CR) jerkins-bot: [V: -1] Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps) [21:13:28] (PS13) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 [21:14:23] (CR) jerkins-bot: [V: -1] Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps) [21:16:07] (PS14) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 [21:17:38] (PS1) Pcoombe: Fix select arrow overlapping contents [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/398550 (https://phabricator.wikimedia.org/T181435) [21:39:00] oh dang, I can't test self-signed https stuff in chrom(e|ium) any more [21:39:10] this is sure to be fun... [21:46:43] also wtf is caching the old css? [21:49:20] ah, silly script to swap deploy / dev extensions seems to have brok [21:49:23] e [21:50:35] (CR) Ejegg: [C: 2] "Looks great. Thanks Peter!" [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/398550 (https://phabricator.wikimedia.org/T181435) (owner: Pcoombe) [21:52:51] (CR) Ejegg: [C: 2] "Code looks good, working locally" [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps) [21:52:54] (Merged) jenkins-bot: Fix select arrow overlapping contents [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/398550 (https://phabricator.wikimedia.org/T181435) (owner: Pcoombe) [21:54:13] (Merged) jenkins-bot: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps) [21:59:09] ok fr-tech, I'm heading out. Have a great weekend!