[02:23:03] <wikibugs>	 Wikimedia-Fundraising, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Wikipedia-iOS-App-Backlog, and 2 others: Run Big English fundraising on apps - https://phabricator.wikimedia.org/T181004#3839355 (bearND)
[02:23:27] <wikibugs>	 Wikimedia-Fundraising, Mobile-Content-Service, Wikipedia-Android-App-Backlog, Wikipedia-iOS-App-Backlog, and 2 others: Run Big English fundraising on apps - https://phabricator.wikimedia.org/T181004#3776010 (bearND) Done, see T182802.
[15:22:42] <ejegg>	 fr-tech I want to use these updates to log non-200 responses from the Amazon SDK: https://github.com/ejegg/login-and-pay-with-amazon-sdk-php/tree/debugpauses
[15:22:49] <ejegg>	 Anybody see any problems?
[15:23:03] <ejegg>	 for background: we're already using a fork of the SDK
[15:23:16] <ejegg>	 that adds a reporting client
[15:23:42] <ejegg>	 since most of the underlying post / decode logic is the same between payments and reporting calls
[15:23:42] <jgleeson>	 that link takes me to the readme
[15:23:50] <jgleeson>	 ejegg
[15:24:04] <ejegg>	 jgleeson: ah, it's the last 2 commits on that branch
[15:24:16] <ejegg>	 https://github.com/ejegg/login-and-pay-with-amazon-sdk-php/commits/debugpauses
[15:28:23] <jgleeson>	 looks good to me. I read the conditional in invokePost as "if not 200,500 or 503" then log
[15:30:40] <jgleeson>	 although I now see that the 500/503 code path triggers a perpetual retry which eventually results in a 200 or non-200 result
[15:34:05] <ejegg>	 jgleeson: wait, really?
[15:34:48] <ejegg>	 ah, no, pauseOnRetry actually throws an exception when we get past MAX_RETRYS
[15:35:19] <mepps>	 ejegg i see some a white spacing issue in BaseClient.php line 50 though not as big a deal
[15:35:33] <ejegg>	 so much whitespacing issues...
[15:47:13] <jgleeson>	 ejegg, sorry yes it does max out within pauseOnRetry
[15:55:10] <ejegg>	 ok, I'll merge those to my fork's dev-master and pull that update into DonationInterface
[16:04:59] <ejegg>	 mepps I keep trying to get WS fixes merged upstream as a precursor to upstreaming the ReportClient stuff
[16:05:02] <ejegg>	 https://github.com/amzn/amazon-pay-sdk-php/pull/63/files
[16:05:16] <ejegg>	 but they're super-unresponsive
[16:05:16] <wikibugs>	 (PS1) Jgleeson: Updated composer package to stats-collector to version 1.0.0 [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/398495
[16:07:44] <ejegg>	 oh hey, dude actually commented on the PR that time!
[16:08:16] * ejegg crosses fingers
[16:08:32] <mepps>	 ejegg nice, also see that you introduce one too in line 50 though
[16:09:13] <ejegg>	 ah, yeah, I'll have to completely redo the reportClient stuff for 3.2.0 before I try upstreaming it
[16:09:57] <ejegg>	 just waiting on the whitespace standardization first, so I can submit the rest as intelligible chunks
[16:11:26] <ejegg>	 the logger stuff is actually already implemented upstream
[16:11:59] <ejegg>	 (though annoyingly, they've just copied Psr\Log right into their lib instead of listing it as a dependency)
[16:15:11] <mepps>	 oh weird
[16:21:29] <jgleeson>	 hi AndyRussG :)
[16:23:23] <AndyRussG>	 jgleeson: morning! :)
[16:25:03] <wikibugs>	 (PS1) Ejegg: Update Amazon SDK fork for logging retries [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/398500 (https://phabricator.wikimedia.org/T182735)
[16:25:08] <ejegg>	 just logging the Amazon, don't be alarmed ^^^
[16:25:29] <mepps>	 i am having the weirdest time with dash ejegg, it's for some reason not registering my code changes
[16:26:26] <ejegg>	 mepps are you running it as an actual service, or just using 'node server.js -d' from the command line each time?
[16:26:47] <mepps>	 node server.js but i restart it when i make code changes
[16:27:10] <mepps>	 ooh i finally got the error i was looking for! phew
[16:39:31] <mepps>	 okay ejegg i just had to find where data.js was used
[17:04:51] <ejegg>	 fr-tech sorry, net issues
[17:09:38] <mepps>	 fr-tech, i caught the internet flu too now!
[17:09:40] <cwd>	 and you're not even in the country that just sold out to telecoms
[17:09:50] <cwd>	 oops, meaning ejegg not mepps :)
[17:27:00] <wikibugs>	 fundraising-tech-ops, monitoring, Epic: overhaul fundraising cluster monitoring - https://phabricator.wikimedia.org/T91508#3841020 (cwdent)
[18:15:31] <jgleeson>	 !log rolled back to civicrm to 798e2467 to investigate prometheus bug
[18:15:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:30:06] <wikibugs>	 Fundraising-Backlog, fundraising-tech-ops: add primary keys or unique indexes to some tables in civicrm, drupal, and pgehres databases - https://phabricator.wikimedia.org/T176631#3841274 (Jgreen) p:Triage>Low Clearly this was not the cause of the replication lag issues that originally prompted me...
[18:32:41] <wikibugs>	 Fundraising-Backlog, fundraising-tech-ops, Operations: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3841291 (Jgreen)
[18:32:43] <wikibugs>	 Fundraising-Backlog, fundraising-tech-ops, Operations, Spike: Spike: Enumerate remaining unported stats - https://phabricator.wikimedia.org/T175850#3841288 (Jgreen) Open>Resolved a:Jgreen We have fundraising grafana dashboards now, that cover the stuff we care about.
[18:37:22] <cwd>	 ?
[18:38:10] <cwd>	 [redacted]
[19:14:29] <cwd>	 mepps, jgleeson_ hi!
[19:14:45] <mepps>	 hi cwd!
[19:15:01] <jgleeson_>	 hi cwd :)
[19:16:05] <cwd>	 how goes? we're seeing some interesting prometheus behavior that might be php related
[19:16:34] <mepps>	 jgleeson you said you were looking at something related to this?
[19:16:44] <jgleeson_>	 yep
[19:17:11] <cwd>	 https://grafana.wikimedia.org/dashboard/db/fundraising-host-overview?panelId=21&fullscreen&orgId=1&var-server=civi1001.frack.eqiad.wmnet&var-datasource=frack.codfw%20prometheus&from=now-6h&to=now&refresh=5m
[19:17:36] <cwd>	 looks like some stuff has not been reporting lately...haven't nailed down what
[19:17:37] <jgleeson_>	 I rolled back about an hour ago a release from last night that was creating erroneous duplicate metric data
[19:17:41] <jgleeson_>	 cwd
[19:17:55] <cwd>	 ah ha, would that have been the donations.prom file?
[19:17:58] <jgleeson_>	 I think it was to do with that
[19:18:00] <jgleeson_>	 yes
[19:18:04] <jgleeson_>	 the new addition
[19:18:23] <cwd>	 great, we blew the dir away and it came back to life so were wondering if it was something like that
[19:18:27] <cwd>	 thanks and carry on!
[19:18:45] <jgleeson_>	 we added a bunch of new averages and overalls but the individual stats that were also output
[19:19:16] <jgleeson_>	 annoyingly it doesn't fail when prometheus scrapes that data directly as a target
[19:19:31] <jgleeson_>	 so I am guessing the node exporter is failing less gracefully
[19:19:51] <jgleeson_>	 which might explain the lack of data entirely for the period
[19:21:17] <jgleeson_>	 my local testing consists of a local install of Prometheus scraping the output text files served by a local PHP server, and it seemed to work using that but as it's not a true reflection of production I suspect the node exporter gap is where it's failing
[19:22:23] <cwd>	 huh i was assuming it was the server scrape that failed
[19:22:27] <cwd>	 cause i saw the file on disk
[19:22:40] <cwd>	 Jeff_Green noticed it had some repeating keys
[19:24:19] <jgleeson_>	 so they scrape locally for me still works
[19:24:38] <jgleeson_>	 but there is a warning in the console log indicating the duplicate key
[19:25:26] <jgleeson_>	 as in, new data is scraped correctly for non-duplicated keys
[19:25:29] <cwd>	 what console is that? i am trying to find an error message
[19:25:34] <cwd>	 like stdout?
[19:25:39] <jgleeson_>	 yeah
[19:25:46] <jgleeson_>	 running ./prometheus
[19:25:49] <jgleeson_>	 with a config.file
[19:25:58] <cwd>	 hrm, bet cron eats it
[19:26:41] <jgleeson_>	 is the node exporter consuming the .prom files and serving them itself?
[19:26:53] <cwd>	 or wait, it is its own cron
[19:28:06] <cwd>	 jgleeson_: in prod the load balancers are running the service that scrapes the servers
[19:28:10] <cwd>	 gets those .prom files over http
[19:28:47] <cwd>	 grafana.wm.o polls pay-lvs for the data to graph
[19:32:00] <jgleeson_>	 it looks like the prometheus node exporter uses a textfile collector to suck up the .prom files and makes it available at the instance/metrics url
[19:33:10] <jgleeson_>	 I'm guessing it's failing somewhere around there when a duplicate key is encountered
[19:33:22] <wikibugs>	 (PS7) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305
[19:34:07] <wikibugs>	 (CR) Mepps: "This is still WIP" [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps)
[19:34:16] <wikibugs>	 (CR) jerkins-bot: [V: -1] Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps)
[19:34:24] <cwd>	 jgleeson_: ah actually the textfile collector is sort of the same thing as the node exporter
[19:34:32] <cwd>	 they both write .prom files to the web root
[19:34:55] <jgleeson_>	 ah I see
[19:34:59] <cwd>	 the node exporter is a bunch of default metrics
[19:35:03] <jgleeson_>	 so it doesn't serve the metrics
[19:35:16] <jgleeson_>	 it just generates them from other sources?
[19:35:21] <cwd>	 we use the text file collector to expose arbitrary stuff
[19:35:31] <cwd>	 yeah you can think of the node exporter as "about this computer"
[19:35:48] <jgleeson_>	 ok that makes more sense
[19:35:58] <cwd>	 and then it has some golang webserver that runs out of the spool dir
[19:36:19] <cwd>	 so the load balancers have a list of hosts and ports to ask for .prom data
[19:36:35] <cwd>	 they in turn deliver it to grafana for pretty pictures
[19:37:12] <wikibugs>	 (PS8) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305
[19:37:15] <cwd>	 for a good time you can tunnel to :9090 on pay-lvs* and see the prometheus interface
[19:38:24] <wikibugs>	 (PS9) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305
[19:38:31] <jgleeson_>	 do you think it's possible to create a test environment locally to simulate production?
[19:38:44] <jgleeson_>	 so I can see where it's falling over specifically
[19:39:17] <cwd>	 jgleeson_: possible, sure, but a lot of work :)
[19:39:33] <cwd>	 Jeff_Green and i use virtualbox locally
[19:39:35] <wikibugs>	 (CR) jerkins-bot: [V: -1] Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps)
[19:39:54] <jgleeson_>	 I'm running the mediawiki vm
[19:40:04] <jgleeson_>	 but for the prometheus stuff I just set it up locally
[19:40:05] <cwd>	 in order to simulate frack in any useful way you need pretty comprehensive networking support
[19:40:29] <wikibugs>	 (PS10) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305
[19:40:52] <cwd>	 but we can narrow this problem down
[19:41:08] <jgleeson_>	 I think the only thing I can do at this point is patch the known bug and release it again until I can work out a way to do a production-like test
[19:41:11] <cwd>	 i'm pretty sure it's in the scraping of the data by the load balancer, because the offending file was on disk
[19:41:17] <cwd>	 with the duplicate keys
[19:41:20] <wikibugs>	 (CR) jerkins-bot: [V: -1] Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps)
[19:41:59] <jgleeson_>	 I'll hold off until next week now, don't want anything to break over the weekend
[19:42:05] <cwd>	 or heck it could even be at grafana
[19:43:40] <ejegg>	 cwd I saw a bug on the textfile exporter that looked relevant
[19:44:09] <ejegg>	 ah, no, red herring
[19:46:08] <ejegg>	 https://github.com/prometheus/client_java/issues/273
[19:46:11] <ejegg>	 maybe?
[19:48:19] <cwd>	 ejegg: oh i think jgleeson_ figured out what it was
[19:48:33] <cwd>	 just interested to see where exactly it was falling over now
[19:49:13] <cwd>	 jgleeson_: i sort of misspoke about node exporter, the web server is contained within that process
[19:49:25] <jgleeson_>	 it was ejegg who spotted the duplicates, I was then trying to track down where it was falling over as I can't reproduce the failure locally
[19:49:32] <jgleeson_>	 ah ok cwd
[19:49:54] <cwd>	 argh and it does not write prom files
[19:50:02] <cwd>	 but it does export a bunch of data, not as a prom file
[19:50:05] <ejegg>	 jgleeson_: so since prometheus is running on your local machine it can read the files directly?
[19:50:16] <jgleeson_>	 no
[19:50:18] <ejegg>	 And reading directly has no problem with the duplicates?
[19:50:22] <jgleeson_>	 I have to serve them over http
[19:50:24] <cwd>	 so it could be the exporter process choking on the bad file
[19:50:33] <jgleeson_>	 so I have been using a simple php server
[19:50:37] <jgleeson_>	 outputting the content
[19:50:54] <jgleeson_>	 which was working fine from a frontend perspective
[19:51:12] <jgleeson_>	 but the actually prometheus binary stdout is showing an error
[19:51:14] <ejegg>	 ah, cool, and the php server sends out the files exactly as written, with no filtering?
[19:51:23] <jgleeson_>	 yep exactly as
[19:51:30] <jgleeson_>	 file_get_contents()
[19:51:46] <jgleeson_>	 I was just sending you an email write up
[19:51:56] <jgleeson_>	 I'll cc cwd in
[19:52:05] <cwd>	 groovy
[19:52:11] <jgleeson_>	 :)
[20:10:26] <Jeff_Green>	 ejegg I'm doing phab cleanup and came across this one https://phabricator.wikimedia.org/T176295 -- do you think these queues are worth reporting given the other queue reporting we have now?
[20:11:46] <ejegg>	 oh, interesting
[20:12:02] <ejegg>	 i'm not sure where those are added to a queue
[20:12:15] <ejegg>	 but they seem like good numbers to have
[20:13:19] <Jeff_Green>	 i'm not either, but I think I can just recycle the query nagios is using to drop a prom file
[20:13:58] <jgleeson_>	 fr-tech I am signing off for the weekend. Have a great weekend and see you Monday o/
[20:14:10] <Jeff_Green>	 have a good weekend jgleeson_!
[20:41:37] <wikibugs>	 Fundraising-Backlog, Support-and-Safety: Revoke centralnotice-admin for Awight (WMF) - https://phabricator.wikimedia.org/T168428#3841667 (jrbs)
[20:44:24] <wikibugs>	 Fundraising-Backlog, Epic: [Epic] Revoke AWight fundraising privileges - https://phabricator.wikimedia.org/T168421#3841690 (jrbs)
[20:44:26] <wikibugs>	 Fundraising-Backlog, Support-and-Safety: Revoke centralnotice-admin for Awight (WMF) - https://phabricator.wikimedia.org/T168428#3841688 (jrbs) Open>Resolved Rights removed: ``` 2017-12-15T20:43:03 JSutherland (WMF) (talk | contribs | block) changed group membership for Awight (WMF) from central...
[20:55:16] <wikibugs>	 Fundraising-Backlog, fundraising-tech-ops: add primary keys or unique indexes to some tables in civicrm, drupal, and pgehres databases - https://phabricator.wikimedia.org/T176631#3841697 (jcrespo) @Jgreen My recommendation, in preparation **for next year**, would be to try to research this and ROW migrat...
[21:04:03] <mepps>	 hey ejegg my patch is working now but ci is rejecting it because of the word catch
[21:04:26] <ejegg>	 mepps oh we should be able to suppress that
[21:04:36] <ejegg>	 lemme see, I think we do that for 'delete' someplace
[21:05:29] <ejegg>	 mepps see server.js line 108
[21:05:39] <ejegg>	  /*jslint -W024*/
[21:06:08] <ejegg>	 tells jslint to stop worrying about re-used keywords
[21:06:45] <mepps>	 ohh i get it
[21:08:00] <wikibugs>	 (PS11) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305
[21:09:00] <wikibugs>	 (CR) jerkins-bot: [V: -1] Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps)
[21:10:36] <ejegg>	 dang, do we need a different code
[21:10:37] <ejegg>	 ?
[21:11:01] <ejegg>	 ah, no, just more style pickiness
[21:11:14] <wikibugs>	 (PS12) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305
[21:12:10] <wikibugs>	 (CR) jerkins-bot: [V: -1] Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps)
[21:13:28] <wikibugs>	 (PS13) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305
[21:14:23] <wikibugs>	 (CR) jerkins-bot: [V: -1] Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps)
[21:16:07] <wikibugs>	 (PS14) Mepps: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305
[21:17:38] <wikibugs>	 (PS1) Pcoombe: Fix select arrow overlapping contents [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/398550 (https://phabricator.wikimedia.org/T181435)
[21:39:00] <ejegg>	 oh dang, I can't test self-signed https stuff in chrom(e|ium) any more
[21:39:10] <ejegg>	 this is sure to be fun...
[21:46:43] <ejegg>	 also wtf is caching the old css?
[21:49:20] <ejegg>	 ah, silly script to swap deploy / dev extensions seems to have brok
[21:49:23] <ejegg>	 e
[21:50:35] <wikibugs>	 (CR) Ejegg: [C: 2] "Looks great. Thanks Peter!" [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/398550 (https://phabricator.wikimedia.org/T181435) (owner: Pcoombe)
[21:52:51] <wikibugs>	 (CR) Ejegg: [C: 2] "Code looks good, working locally" [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps)
[21:52:54] <wikibugs>	 (Merged) jenkins-bot: Fix select arrow overlapping contents [extensions/DonationInterface] - https://gerrit.wikimedia.org/r/398550 (https://phabricator.wikimedia.org/T181435) (owner: Pcoombe)
[21:54:13] <wikibugs>	 (Merged) jenkins-bot: Use pool in data.js, don't close pool [wikimedia/fundraising/dash] - https://gerrit.wikimedia.org/r/398305 (owner: Mepps)
[21:59:09] <ejegg>	 ok fr-tech, I'm heading out. Have a great weekend!