[03:24:34] eileen: hi! ahhh back at the keyboard for a little while now this evening... how's it going? do you know where the e-mail pref ctr got to? [03:32:18] no I don't - it seemed a hive of activity [03:36:19] yeah [03:37:30] hmmm when I load a page it just seems to hang [03:37:50] I assume some sort of connection issue with CiviProxy then [03:38:10] yeah [03:39:37] hmmm [03:49:22] I think there was an issue with firewalls & redis & stuff like that [03:49:50] but I think it's in Dallas/Jeff's court [03:51:37] yeah [03:51:53] there definitely were such issues, but I don't know where they ended up at the end of the day [04:22:27] (PS6) AndyRussG: Add editable Name pair search using search kit [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/681505 (https://phabricator.wikimedia.org/T244404) (owner: Eileen) [04:24:11] eileen: is it normal that every time I switch branches on the crm repo I have to reset composer.lock? [04:26:20] hmm - that sounds odd - between deployment & master or your own locally? [04:30:38] eileen: my own locally, like from one review branch to another or back to master [04:43:02] AndyRussG: hmm that's odd - maybe do a git pull --rebase to make sure you are running over the latest? [05:05:20] hmmmm k I'll try that next time [05:05:47] cya! (heading to sleep early since one kid has zoom class in less than 8 hours...) [05:11:17] night [05:16:32] (PS3) Eileen: Catch and handle exceptions in Monolog [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/684664 (https://phabricator.wikimedia.org/T281647) [05:16:34] (PS1) Eileen: Integrity fixes to the search & lock file [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/685637 [13:22:34] Fundraising Sprint Princess Mongodb, Fundraising-Backlog, fundraising-tech-ops, Fr-planning-ahead, and 2 others: Set up and configure donor_prefs redis instance on FR queue role - https://phabricator.wikimedia.org/T281955 (ayounsi) [15:11:02] howdy fr-techers [15:24:18] Fundraising-Backlog, Wikimedia-Fundraising-CiviCRM, FR-AutoTY-Email: Civi stock TY email template missing gift date - https://phabricator.wikimedia.org/T281313 (DStrine) [15:24:18] hi jgleeson! i'm not quite really online yet [15:24:41] but i think the e-p-c site might just be able to connect to civiproxy now [15:24:49] Jeff_Green: dwisehaupt any updates on the network config & redis instance for the EPC work? [15:24:56] ah cool! [15:25:03] worth testing! [15:25:07] I'll try ejegg|away thanks for the heads up [15:25:22] we seem to have worked out all the localettings kinks yestersay [15:25:35] jgleeson: redis is up, network has one snag to resolv [15:25:42] jgleeson: just fixed about 2 minutes ago. [15:25:57] awesomes [15:26:06] was just working with XioNoX on it. [15:27:40] hi jgleeson dwisehaupt jgleeson ejegg|away fr-tech [15:28:05] howdy! [15:28:25] :) [15:28:26] fr-tech I'm still hitting the runtime exception (expected maybe?) https://fundraising.frdev.wikimedia.org/index.php/Special:EmailPreferences/emailPreferences?contact_id=13&contact_hash=111 [15:28:38] yeah same here [15:28:45] I guess we've got some loggin' we can check now though! [15:28:59] just gotta find my yubikey [15:29:11] yep yep you should now be able to see those im the frdata log [15:29:18] also for a follow-on we should handle errors a weee bit more gracefully [15:29:44] AndyRussG: maybe we should catch those errors and render the nicer error page? [15:29:47] ejegg|away: is that on frdata itself btw? yesterday I tried to log in there, but I was not permitted such an experience [15:29:56] ejegg|away: yep eyp [15:30:00] *eyp [15:30:02] *yep [15:30:43] aaargh if you make the same typo twice, should you correct yourself twice? not sure now [15:31:43] lol I'd be correcting for a while if we adopted that policy AndyRussG :) [15:32:33] hehe I might get stuck in an infinite error-and-self-correction loop [15:32:38] I'm impressed at the amount of times my fingers can repeatedly misspell a word I'm trying to correct at pace [15:32:41] lol [15:33:34] heh what I find impressive is how quickly my kids can type on those touch-screen phone and tablet keyboards [15:34:00] and how spelling mistakes abound and they barely even notice them [15:34:32] do you use the swipe touchscreen keyboard thingy AndyRussG ? [15:34:40] making random spelling mistakes and typos seems like just part of the language [15:35:19] yeah I do jgleeson though sometimes it's annoying because no matter how hard I try there are some words that it just doesn't get [15:35:29] ah yeah that is a thing [15:35:41] and so then after having tried to swipe the word five times I give in and actually type it [15:35:50] yep.. been there [15:36:06] it looks funny watching others use it from side-on [15:54:16] so it looks like we might wanna look at the civicrm logs [15:54:16] ie, we could test from frdata across the full apache/nginx chain, but also on the civi host directly against the civiproxy port. [15:54:21] that's sending back a 500 error [15:54:24] i see this too PHP Fatal error: require_once(): Failed opening required 'config.php' (include_path='.:/usr/share/php') in /srv/civiproxy/proxy/rest.php on line 10 [15:54:34] oh rly! [15:54:36] good spot [15:54:49] oooh noice [15:55:08] cstone: is that locally or in the live logs [15:55:10] yeah I was going to say, looking at which line the error is coming from, it looks like maybe not a network thing [15:55:13] live [15:55:21] when i tried to hit my contact id and hash [15:55:51] ah no I take that back, it could still be [15:55:55] interesting [15:56:11] https://etherpad.wikimedia.org/p/emailpreferrors [15:56:12] so I'm seeing this, which suggest civiproxy is trying to hit civicrm [15:56:17] heres the whole chunk ^ [15:56:18] May 6 15:27:50 frdata1002 CiviproxyConnector: contact id: 13, Server error: `GET https://civi1001.frack.eqiad.wmnet:442/rest.php?entity=civiproxy&action=getpreferences& [15:56:19] key=SITE_KEY&api_key=API_KEY&version=3&json=1&hash=111&contact_id=13` resulted in a `500 Internal Server Error` response [15:56:52] yeah i see that too jgleeson [15:57:42] ok let's dig into the config one first [15:57:58] civiproxy is on civi1001 right [15:58:06] under /srv/civiproxy [15:58:14] so let's see what config.php looks like there [15:58:31] correct. [15:58:53] ok first issue [15:59:48] I'm guessing we set the deployment job to move localsettings/civiproxy/config.php to /srv/civiproxy/config.php but instead it needs to go to /srv/civiproxy/proxy/config.php [15:59:54] dwisehaupt: Jeff_Green can we update that? [16:00:16] civiproxy's project root isn't actually the root of the bit we use [16:00:39] /srv/civiproxy/proxy is the root where we need to drop config.php. easy to see how that's happened [16:01:09] I wonder if I can temporarily just move it to see if it fixes things [16:01:42] sure. that's doable. [16:01:51] i'll move it by hand real quick to verify. [16:02:03] thanks! [16:02:11] I don't have perms which I kinda expected [16:02:24] done [16:03:10] let's give that a whirl [16:03:26] ok still erroring but let's check the logs [16:05:31] ok I now see [16:05:33] May 6 16:03:13 frdata1002 CiviproxyConnector: contact id: 13, Client error: `GET https://civi1001.frack.eqiad.wmnet:442/rest.php?entity=civiproxy&action=getpreferences& [16:05:35] key=SITE_KEY&api_key=API_KEY&version=3&json=1&hash=111&contact_id=13` resulted in a `400 This CiviProxy installation requires SSL encryption. (CiviProxy 0.6-dev)` respon [16:05:37] se: CiviProxy Error https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/DonationInterface/+/refs/heads/master/extras/civiproxy/CiviproxyConnect.php#18 [17:03:52] ok for a quick test, should we try setting that to 'true' and pushing it out [17:04:00] to see if that causes more specific errors [17:04:05] hmmm [17:04:07] AndyRussG: ? [17:04:15] so we should also be logging the specific error here: $logger->error( "contact id: $contact_id, " . $e->getMessage() ); [17:04:41] (line 61) [17:04:46] i think we do get an error, it's the response from civiproxy telling us you're not running me over ssl [17:04:47] are we not getting that? [17:05:09] but I'm wondering if once we set verify to true, it would bomb out earlier due to the suspect non-ssl transport [17:05:09] hmmm ok just for reference, where is that specific info? [17:05:39] jgleeson: I think the verify => false just makes it accept self-signed certs [17:05:55] AndyRussG: it's in the logs on frlog1001 under /var/log/remote/frdata. I added the specific error to the bottom of cstone's etherpad here https://etherpad.wikimedia.org/p/emailpreferrors [17:05:58] if the protocol says https I doubt it's tyring anthing else [17:06:48] I think if guzzle is using curl under the hood, and it sets verify host&peer to false, it will resolve https:// urls even if they are not ssl [17:06:57] that's a guess [17:07:10] jgleeson: ejegg: i have reenabled the fr-tech accounts on the frdata role. [17:07:23] thanks dwisehaupt [17:07:32] ok, so now we can make those curl calls to see what the response it [17:07:33] i have also fixed and pushed the change to distribute the config.php file to the proxy subdir [17:07:34] ok thanks dwisehaupt. we should be able to check ourslves [17:07:35] *is [17:07:42] yup yup [17:08:12] hmm [17:08:15] channel 0: open failed: connect failed: Connection refused [17:08:42] do we go in via the usual frbast.wikimedia.org dwisehaupt ? [17:09:07] yeah. usual route. [17:09:12] jgleeson: oohhhh [17:09:29] it doesn't like me [17:09:36] what do you want to bet the is_secure check is just trying to see if it's port 443? [17:09:47] or maybe nginx needs to forward an additional header? [17:09:50] omg [17:09:53] maybe!! [17:09:59] we can check the code to confirm [17:10:03] dwisehaupt: civiproxy is served by apache, right? [17:10:14] ejegg: correct [17:10:15] and nginx does an ssl termination on 442 and then forwards to apache? [17:10:20] correct [17:10:29] ok, let's see what the code checks [17:10:35] if ($_SERVER['HTTPS'] != "on") { [17:10:37] civiproxy_http_error("This CiviProxy installation requires SSL encryption.", 400); [17:10:39] } [17:10:44] oh ok [17:11:02] so... can we tell apache to set that? [17:11:06] 'HTTPS' [17:11:07] Set to a non-empty value if the script was queried through the HTTPS protocol. [17:11:20] from php.net [17:11:49] fr-tech anyone able to ssh to frdata1001? [17:12:28] it's frdata1002 [17:12:52] jgleeson: yeah, it is indeed complaining about not being HTTPS [17:13:01] so we just need to get that variable set [17:13:09] I'm sure I've seen this before [17:13:34] let me look at our configs and see if we've seen this before [17:13:43] the best thing would be for the code to understand X-Forwarded-Proto [17:13:48] I think mediawiki does that [17:14:11] so nginx would set that header to X-Forwarded-Proto='https' on its request to apache [17:14:31] but then the php code needs to understand that and set the $_SERVER variable somewhere early in the bootstrapping [17:14:45] which I'm guessing civiproxy doesn't do [17:14:54] but it would be a useful thing to upstream! [17:15:09] interesting that we don't hit this locally? [17:15:22] yeah. we definitely set that X-Forwarded-Proto header. [17:15:44] AndyRussG: locally we have apache handle the ssl termination [17:15:51] so it sets SERVER'HTTPS' to true [17:16:09] it's because of the nginx->apache http request that prod has the issue [17:16:16] ahhh right thx :) [17:16:58] (60) SSL certificate problem: unable to get local issuer certificate [17:17:00] More details here: https://curl.haxx.se/docs/sslcerts.html [17:17:21] oh wait [17:17:31] I'm trying to hit civicrm [17:17:59] actually no I'm not [17:18:29] fr-tech when I hit the URL in the etherpad I'm getting ssl errors [17:18:44] hitting it from civi1001 [17:20:17] * TCP_NODELAY set [17:20:19] * Expire in 200 ms for 4 (transfer 0x559f99526fb0) [17:20:21] * Connected to civi1001.frack.eqiad.wmnet (10.64.40.109) port 442 (#0) [17:20:23] * ALPN, offering h2 [17:20:25] * ALPN, offering http/1.1 [17:20:27] * successfully set certificate verify locations: [17:20:29] * CAfile: none [17:20:31] CApath: /etc/ssl/certs [17:20:33] * TLSv1.3 (OUT), TLS handshake, Client hello (1): [17:20:35] * TLSv1.3 (IN), TLS handshake, Server hello (2): [17:20:37] * TLSv1.2 (IN), TLS handshake, Certificate (11): [17:20:39] * TLSv1.2 (OUT), TLS alert, unknown CA (560): [17:20:41] * SSL certificate problem: unable to get local issuer certificate [17:20:43] * Closing connection 0 [17:20:55] this might be a red herring as this is from civi1001 and not frdata1001 (which I can't connect to atm) [17:21:08] yes. we are using the puppet cert. so you would need to specify the CA cert file [17:21:20] jgleeson: connect to frdata1002 instead of frdata1001 [17:21:49] ohh ok this might be the issue [17:22:05] are we specifying the ca cert somewhere on EPC/frdata? [17:22:43] SetEnvIf X-Forwarded-Proto "https" HTTPS=on [17:22:59] I think we can add ^^^ to the apache config [17:23:00] i don't know if it was needed. different software behaves differently and not everything requires the cert to be validated. [17:23:03] lemme try that locally [17:23:10] ejegg: but curl is reporting a CA problem also [17:23:14] maybe 2 issues? [17:23:24] it can see the cert but can't validate it [17:23:29] when I hit the URL [17:23:46] jgleeson wait, didn't AndyRussG say we're specifically not validating? [17:23:55] oh yes he did [17:24:38] jgleeson: you can test with validation on civi1001 by adding the following to your curl command: --cacert /etc/ssl/certs/fundraising_CA.pem [17:24:38] or at least that's how it went locally, yeah, setting that to false did prevent self-signed issues in the e-p-c call to civiproxi [17:24:40] https://phabricator.wikimedia.org/diffusion/EDOI/browse/master/extras/civiproxy/CiviproxyConnect.php$29 [17:24:49] hang on a sec [17:25:01] i just did wget --no-check-certificate from the frdata server [17:25:06] and it made the request fine [17:25:09] we were deliberately not verifying locally to avoid out self-signed cert problems in local dev [17:25:29] this is unrelated to prod [17:25:39] jgleeson: ahh, we're also not verifying in prod because it's a self-signed cert there too! [17:25:58] ah ok [17:26:01] got it [17:26:24] dwisehaupt: what about adding: SetEnvIf X-Forwarded-Proto "https" HTTPS=on [17:26:26] yea for better or for worse the verify=>false is hardcoded [17:26:32] to the apache config for civiproxy? [17:26:54] It looks like that env var is what ends up in SERVER['HTTPS'] [17:27:34] worth a shot. let me hand tweeze it and we can test. [17:27:47] and it sounds ok to use in this case. [17:28:02] dwisehaupt: added the cacert to the curl request quieted down the error [17:28:16] I now get the (expected?) 'Only authorized clients may access this server' [17:28:38] guessing apache is only allowing traffic from frdata? [17:29:04] nginx even [17:29:46] ok. set and apache reloaded. [17:29:50] feel free to test. [17:30:08] fr-tech this is a kinda an aside but I'm surprised I'm not seeing the log from this line https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/DonationInterface/+/refs/heads/master/extras/civiproxy/CiviproxyConnect.php#61 [17:30:49] fr-tech oooooh progress!!!! [17:30:51] EmailPreferences: Error from civiproxy: API permission check failed for Civiproxy/getpreferences call; insufficient permission: require administer CiviCRM [17:30:59] hmm AndyRussG was that the one i wasn't seeing locally unless i swapped the logic [17:31:00] ooh nice [17:31:17] cstone which logic did u swap? [17:31:25] hmm maybe it was a different spot cause yeah theres none haha [17:31:44] yeah sorry AndyRussG I was thinking a different spot [17:31:46] AndyRussG: I wonder if you're not seeing that due to no exception being thrown. maybe it's handling the error response gracefully? [17:32:09] ah ok thx cstone [17:32:09] ahh AndyRussG that's when the API key isn't pointing to the admin user [17:32:12] i'll look to codify that apache config change. [17:32:27] let's check the civiproxy config [17:33:01] jgleeson the only way for on_error to be set to true is for either civiproxi itself to set that flag (which would only have happened if the request were succeeding) or for the code to make it to that catch block [17:33:33] dwisehaupt: yay! we got further along, now the error is from civiproxi connecting to civi ^ what jgleeson mentioned :) [17:35:54] ok. i need to roll this conifg update otherwise puppet will clobber my hand alter. [17:36:31] sweet! [17:36:39] yeah I see what you mean AndyRussG. not sure why the [17:36:41] then* [17:37:08] AndyRussG: jgleeson: I think we don't want it to point to the admin user [17:37:18] I think we just want to change that api call to not require admin permission [17:37:29] we created a special user specifically to limit the power of that call [17:37:36] oh ok. we didn't get that specific on local [17:37:37] lemme see what the syntax is [17:37:45] makes sense [17:38:09] ok. rolled and puppet run completed. [17:38:36] thanks dwisehaupt [17:40:12] yeah fr-tech, the user api key in the config points to the user 'fr-tech@wikimedia.org' [17:40:28] guessing we need to give that account access [17:40:50] not sure how we set the role levels on api calls in civicrm 0_0 [17:41:19] is there an api RBACL maybe [17:42:01] hmm, maybe https://docs.civicrm.org/dev/en/latest/hooks/hook_civicrm_alterAPIPermissions/ [17:42:03] https://docs.civicrm.org/dev/en/latest/security/access/ [17:42:11] ah [17:42:12] jgleeson: we don't want to give that account admin access [17:42:26] we want to require fewer perms for getpreferences [17:42:47] yeah I wasn't thinking that. I was thinking of a way of allowing that account to access the resource/api call [17:43:01] that was the first google result for civicrm acl [17:43:49] looking at your link [17:44:15] ohhh interesting [17:44:17] If a given entity/action permissions are unset, the default "administer CiviCRM" permission is enforced. [17:45:59] looks like we need to update this file https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/crm/+/649729/13/drupal/sites/default/civicrm/extensions/wmf-civicrm/api/v3/Civiproxy/Getpreferences.php [17:47:07] fr-tech do we want create a new role for the fr-tech user (user we're referring to with our API key) ? [17:47:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 2 [=1] [17:47:36] fr-tech not to distract but ingenico is creating some failmail although the UI eventually loads for me [17:47:44] or do we wanna make the getprefs get call available to all users who have the role 'view all contacts' [17:47:58] ideally the former for me [17:48:01] ah ok cstone [17:48:08] (PS1) Ejegg: Relax API perms for proxy getpreferences [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/685875 [17:48:13] jgleeson: I think it's ^^^ [17:48:16] and a test donation went through so users arent getting errors [17:48:17] Could not resolve host: world.api-ingenico.com [17:48:31] timeouts their end? [17:48:41] yeah looks like it [17:48:45] (PS2) Ejegg: Relax API perms for proxy getpreferences [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/685875 [17:49:06] jgleeson: I think the 'view all contacts' is fine [17:49:26] if someone can view all contacts via the UI, why not let them view a small subset of contact info via the API? [17:49:47] jgleeson: ooh, that sounds more like DNS errors [17:49:47] yeah I suppose [17:50:22] world.api-ingenico.com timing out for me [17:50:32] not sure if it's accessible to the www [17:50:33] ah damn [17:50:43] let's see if their other api server is working [17:51:07] I got a donation through so not everything seems to be broken [17:51:18] we got an icinga alert above but slightly misleading as it's named GlobalCollect [17:51:24] but yeah still a starting point [17:51:55] yeah. i'm seeing some (possibly unrelated) dns resolution issues with different services also. [17:52:29] locally I can lookup eu.api-ingenico.com and world.api-ingenico.com, but on payments1004 both are failing [17:52:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 1 [=1] [17:52:52] dwisehaupt: let's get netops on it [17:53:12] should we switch off ingenico via LocalSettings for now? [17:53:24] that might break any deeplinks [17:53:42] if we're running anything [17:53:58] south america campaign is live [17:54:09] and i think peru + uruguay still run via ingenico [17:54:17] okies [17:54:34] hmm hmm, but it's sometimes working, i guess, from what cstone sees? [17:54:46] just from a person donating i didnt see explosions [17:54:53] and got a thank you email [17:55:40] yah, I see some api calls working and others failing [17:56:15] same [17:56:19] oh okay this time I got a JS alert error do we have a lot of those? hah [17:57:30] did the logs stop/ [17:57:32] ? [17:57:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 1 [=1] [17:57:35] i'm seeing 9 failures in the logs [17:58:10] so not the majority of payments [17:58:14] the form loaded this time for me [17:59:39] ok, going to try that api call locally with a non-admin user [17:59:54] oh yeah sorry ejegg forgot about that [18:02:16] dstrine: XenoRyet|afk we might wanna alert evelyn [18:02:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 2 [=1] [18:03:46] ejegg: I'm seeing way more than 9 [18:04:38] cstone a JS alert error?? [18:05:04] looking to see if these are the same type of errors i emailed them about a bit ago [18:05:07] AndyRussG: yeah haha [18:05:25] AndyRussG: cstone i think when the js fails we call home via js [18:05:33] I saw some of them failures in the logs [18:05:49] ejegg: 799/3 failures [18:06:14] ~266 [18:06:15] nevermind the thing im thinking of was paypal [18:06:23] on the assumption each attempt repeats 3 times [18:07:12] so the js error might be due to the first part of the call failing [18:07:17] May 6 17:32:09 payments1002 SmashPig: ingenico::101945666:101945666.1 | cURL transaction to https://world.api-ingenico.com/v1/6570/hostedcheckouts failed: (6) Could not resolve host: world.api-ingenico.com. cURL verbose logging: * Expire in 0 ms for 1 (transfer 0x558ba7e044f0) [18:07:32] that's an ajaxy event to load the checkout [18:07:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 2 [=1],Paypal_endpoint_critical 1 [=1] [18:07:49] yeah just got it after spinny icon for a while then an alert saying "There was an error processing your request" [18:07:54] but there are other errors happening for folks who actually load the checkout [18:08:26] I can see follow-on calls to specific hosted checkout sessions failing also [18:08:57] i can message evelyn shes on hangouts sometimes [18:09:48] since this isn't stopping should we have them take down the ingenico banners? [18:10:21] lemme see if I can get a ratio of success/fail [18:10:35] no changes to network in the time period. [18:11:21] server admin log is clean also. [18:11:34] we're gettng hardly any donations [18:12:09] a handful of astropay/paypal coming through [18:12:09] the failmails are getting more frequent [18:12:21] ok maybe they've fell over entirely [18:12:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 2 [=1] [18:12:44] yeah let's raise a flare [18:12:52] okie ill message in the same channel as we did yesterday [18:13:03] thanks [18:13:30] ooh scary alert when @ all haha [18:13:57] interesting. a recursive dig gets an ip. [18:15:20] cstone moving here :) Taking down Peru, Uruguay [18:15:24] thanks! [18:15:26] thanks [18:15:26] thanks tskaff [18:17:22] oh hey I see some successes [18:17:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 2 [=1] [18:17:44] i think they're back mayve [18:17:46] maybe [18:17:50] im still getting spinny icon [18:17:57] ah no still errors [18:18:01] some* [18:18:12] Ok they're down [18:19:52] tskaff: thanks!! :) [18:19:56] our EPC deployment is turning into a bad omen [18:20:25] cursed :( haha [18:22:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 1 [=1] [18:23:11] https://ingenico-ogone.statuspage.io/ says they are all good (to themselves) [18:23:18] but that doesn't mean much. [18:24:43] i can't see anything useful in the ingenico prod console [18:26:28] :( [18:26:36] hmm let's reach out in an email [18:27:10] jgleeson: evelyn just responded to me in gchat [18:27:18] nice [18:27:32] she seems to be talking to them somehow but yeah we can email too [18:27:34] RECOVERY - check_log_messages on frav1002 is OK: OK [18:27:52] ok we got 1 successful donation from portugal via ingenico [18:28:26] peru sorry*** [18:28:50] completely unrelated but why does my mac think payments is in maintenance mode still [18:28:59] !! [18:29:02] broswer cache? [18:29:07] spelt right [18:29:17] i tried incognito but not a different browser lemme see [18:29:31] safari still does it haha [18:30:04] ingenico loading for me [18:30:52] Your transaction could not be accepted. [18:30:57] step one loaded then payment failed [18:31:07] slow recovery hopefully?? [18:31:42] I'd say they are still on fire [18:31:52] is Evelyn reaching out? [18:31:55] hah evelyn is in the other channel [18:31:59] ha ok [18:32:01] and yes [18:37:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 1 [=1] [18:38:01] jgleeson / cstone I verified locally that https://gerrit.wikimedia.org/r/c/wikimedia/fundraising/crm/+/685875/ allows a user with only 'view all contact' perms to make that API call [18:38:35] +2'ing [18:38:39] thanks! [18:38:40] nice [18:38:52] I was just looking at it thinking how to renable the module [18:39:04] re-enable which module? [18:39:06] did you need to do that ejegg for it to appreciate the change? [18:39:14] jgleeson: no, not needed at all [18:39:25] sorry, the ext [18:39:27] oh [18:39:29] ok [18:39:30] just having that function exist makes the hook get called on each api request [18:39:34] ah nice [18:39:40] it felt like a setup thing [18:39:51] saved me going down that rabbit hole [18:39:56] yeah, these api calls aren't configured in the db [18:40:18] (CR) Jgleeson: [C: +2] Relax API perms for proxy getpreferences [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/685875 (owner: Ejegg) [18:40:18] the db has 'which users have which roles' and 'which roles have which permissions' [18:40:29] but then the code determines 'which api call NEEDS which permissions' [18:40:40] thanks! I'll deploy that once it merges [18:41:01] I've gotta wrap up for today [18:41:13] catch you all tomorrow/tuesday [18:41:22] thanks jgleeson [18:41:49] I'll be around on mobile just in case anything else bad happens because I know other folks have other distractions [18:42:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 1 [=1] [18:42:40] understandable*** distractions :) [18:42:47] :) [18:43:13] (PS1) Ejegg: Relax API perms for proxy getpreferences [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/685884 [18:43:15] (PS1) Ejegg: Merge branch 'master' into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/685885 [18:43:22] oh derp [18:43:36] why did that submit like that? [18:43:49] oh oops [18:44:23] (Abandoned) Ejegg: Relax API perms for proxy getpreferences [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/685884 (owner: Ejegg) [18:45:53] (PS2) Ejegg: Merge branch 'master' into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/685885 [18:46:03] (CR) Ejegg: [C: +2] Merge branch 'master' into deployment [wikimedia/fundraising/crm] (deployment) - https://gerrit.wikimedia.org/r/685885 (owner: Ejegg) [18:47:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 2 [=1] [18:48:54] ACKNOWLEDGEMENT - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 2 [=1] Dwisehaupt known ingenico issue. [18:53:20] (Merged) jenkins-bot: Relax API perms for proxy getpreferences [wikimedia/fundraising/crm] - https://gerrit.wikimedia.org/r/685875 (owner: Ejegg) [19:04:03] ejegg: cool lgtm! [19:04:31] !log updated fundraising CiviCRM from 8034e47008 to 2052d79248 [19:04:34] thanks AndyRussG [19:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:44] just deployed it, let's test again [19:05:27] oooh, I got the form! [19:05:34] but submitting crashes :) [19:05:43] ejegg: got it tooo!!!! :) [19:05:48] rdbms error [19:05:48] :D me too [19:06:26] cstone: [19:06:40] cstone ejegg also it never stops loading for me [19:07:30] i added the error i saw to the etherpad https://etherpad.wikimedia.org/p/emailpreferrors [19:07:31] huh weird error also [19:09:38] remember, if we don't want to test on live data we can point it at frdev. the config should match there from a system standpoint [19:10:32] hmmm interesting dwisehaupt... I think we have some test users somewhere? or our own users? [19:10:47] not sure if the queue consumer job is currently enabled [19:15:14] yeah. it's possible there's a one off here and there that would need to be adjusted. [19:37:34] RECOVERY - check_log_messages on frav1002 is OK: OK [19:42:56] Iooks like ingenico is behaving again [19:47:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 1 [=1] [19:50:27] ok, so we have to figure out what's trying to write to the db on that submit, huh? [19:50:38] ejegg: yep [19:50:56] too bad there's not a full stack trace in the error log there [19:51:10] I guess debug locally with a breakpoint in Wikimedia\Rdbms\Database::beginIfImplied [19:51:12] also any idea why it might appear to try to load continuously? my guess so far looking at browser tools was the favicon [19:51:19] yep [19:51:33] oh hmm, it finishes loading for me [19:51:42] but we should certainly serve a nice wiki logo [19:51:47] as the favicon [19:52:07] I mean, that would be a nice thing to do, not that I expect it to already do that [19:52:22] right [19:52:34] RECOVERY - check_log_messages on frav1002 is OK: OK [19:52:35] yeah twice on FF it didn't finish loading, and then after that it did [19:54:43] checking for blocked user maybe? [19:55:47] hmm, that's a pure select, why would it begin a txn? [19:57:05] we should def set the email prefs wiki locally to readonly [19:57:09] to catch things like thigs [19:57:51] ejegg: yeah agreed [19:59:08] ejegg I should deal with some non-work stuff for about 1.5 hrs unexpectedly but I can definitely figure this one out after that [19:59:55] ejegg: the other thing that seemed surprising to me is that when the civiproxi connection was erroring out, it didn't look like this log message was getting through: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/DonationInterface/+/refs/heads/master/extras/civiproxy/CiviproxyConnect.php#61 [20:00:22] dunno if you can see anything right away that might be an issue there, otherwise I can also dig in there later :) [20:01:48] ejegg: Jeff_Green may have some info on setting it readonly with sqlite. it may not be straightforward. [20:02:13] oh it IS from MediaWiki\Block\DatabaseBlock::newLoad [20:02:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 1 [=1] [20:02:41] I think the wiki itself was set readonly last I messed with it, just in the config [20:02:52] Jeff_Green: yep, that's as it should be [20:03:52] Jeff_Green: dwisehaupt: yes all good... so we get this error: May 6 19:05:14 frdata1002 mediawiki[1932]: mediawiki[1932]: [80249ca423bb80bbee306629] /index.php/Special:EmailPreferences/emailPreferences?contact_id=2241615&contact_hash=4285ea7a8f0ff15a74ca92c216abedc1 Wikimedia\Rdbms\DBQueryError from line 1699 of /srv/donorwiki/includes/libs/rdbms/database/Database.php: Error 8: attempt to [20:03:54] write a readonly database Function: Wikimedia\Rdbms\Database::beginIfImplied (MediaWiki\Block\DatabaseBlock::newLoad) Query: BEGIN IMMEDIATE [20:03:55] re. sqlite, I kind of forget what I discovered [20:04:08] which is great, we just need to figure out why something is trying to write there [20:04:17] ejegg: maybe something about storing the session? [20:04:38] I remember there being something about sqlite wanting to be able to write files even if you weren't writing to it [20:05:10] hmmm this seems more on the level of Mediawiki trying to some kind of write query [20:06:14] looking at the config [20:06:43] AndyRussG: is it not the cache db? [20:06:55] $wgObjectCaches[CACHE_DB] = [ [20:07:01] 'server' => [ [20:07:06] 'type' => 'sqlite', [20:07:06] 'dbname' => 'wikicache', [20:07:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 1 [=1] [20:09:32] Jeff_Green: that seems reasonable, but I'm sure we use object caches on payments and it's also read-only, no? [20:09:53] payments has memcache to work with [20:11:40] Jeff_Green: so no memcache on e-p-c? [20:11:55] there isn't at the moment, no [20:12:06] K that'll prolly be an issue [20:12:16] in paymentsland I believe it's there for the shared session cache [20:12:29] Unless you want to make that wikicache db writable [20:12:34] PROBLEM - check_log_messages on frav1002 is CRITICAL: CRITICAL: GlobalCollect_endpoint_critical 1 [=1] [20:12:44] Jeff_Green: we also are using a session cache for e-p-c [20:14:37] I mean shared between servers, so shifts in load balancing doesn't break everything about sessions and throttling [20:15:10] I'm not opposed to spinning up memcached, just wondering what we're solving with it before we add services [20:15:22] right [20:16:01] I'm not super familiar with the mw options for session caching, I assumed it would use php built in stuff [20:16:02] Jeff_Green: well currently it's just a minor thing we keep in session between when the form displays and the user submits [20:16:17] Jeff_Green: right if there are other simpler options that's great [20:16:34] Just so long as it persists between web requests [20:17:14] Jeff_Green: if u like a bit later I can confirm that that's the issue and look into what the simplest option would be [20:17:24] https://www.mediawiki.org/wiki/Manual:$wgSessionCacheType [20:17:29] (gonna be not fully focused for the next little while) [20:17:34] RECOVERY - check_log_messages on frav1002 is OK: OK [20:18:29] AndyRussG: sure [20:18:36] https://www.mediawiki.org/wiki/Manual:$wgMainCacheType [20:18:57] that's set to CACHE_ACCEL [20:19:29] Jeff_Green can we just set the object cache to that too then? might be a quick fix [20:19:39] I'm not sure [20:19:41] though again I haven't confirmed that that's the issue, it does seem likely [20:19:45] it does [20:20:38] I think CACHE_ACCEL should find OPcache [20:22:10] I mean, so long as nothing explodes egregiously if we try hehehe :) [20:22:40] lol [20:23:40] can you test and make sure it does what it needs to? [20:24:36] Jeff_Green: yes I can do that, again, it'd be a bit later... But if it prevents an error when form submits, that'll be a step [20:25:40] Though actually, come to think of it, the session storage thing happens when the form displays, not when it submits, so I suppose that's when I might have expected that to error out [20:26:07] AndyRussG: sorry, was afk for a bit [20:26:33] ejegg: it's ok I'm like 80% afk still :) [20:26:35] so that DatabaseBlock::newLoad mentioned there [20:26:50] is something to do with the 'blocked users/ips' feature of mediawiki [20:27:02] ejegg: ^ see my last comment, if it was due to us writing session, that might have happened on form load, not form submission? [20:27:15] right, I don't think it's session [20:27:52] ejegg: ^ also note Jeff_Green mentioned we don't have memcached, and object cache is set to use DB [20:28:16] right, that's why i AM curious why the resourceloader readonly cache errors went away when we told RL to use the objectcache instead of the main db [20:28:38] hmmmm [20:29:01] ejegg: what if the correspondence between form submission and the rdbms error is just coincidence? [20:29:33] fr-tech ingenico update failmail has stopped and logs look normal, banners have been put back up [20:29:34] it could be rather just that some MW job runs a bit after the form is loaded, and it just happened to show up in the log at the same time one of us submitted the form? [20:29:54] cstone: cool!! :) [20:32:22] AndyRussG: we get one step of the traceback, which points to DatabaseBlock::newLoad [20:32:48] that's a function which is indrectly called from Mediawiki::performRequest [20:32:59] before the special page code is executed [20:33:25] i'm trying to understand why it only errors on submit, because it seems to also call that fn on initial load [20:35:43] ejegg: thx for digging in :) [20:48:12] grr grr, i guess i need to set up a sqlite db locally [20:48:21] since it runs different code [20:49:36] hmm, I guess we want to do the install from scratch on the sqlite db [20:50:06] oh ha, or I could use the one from frpm [20:50:12] * ejegg snag [20:52:31] hmm, that donorwiki_l10n_cache.sqlite being readonly will be a bit of a bear [20:52:50] ops expect to recreate that every time we change translations? [20:54:14] dwisehaupt: I can't read the deployed LocalSettings.php on frdata [20:54:15] you mean that it'd need to write to the db every time we update the code with new i18n messages in the code? [20:54:40] dwisehaupt: does it have $wgReadOnly = 'whatever' uncommented, or commented? [20:54:53] AndyRussG: isn't that what the l10n_cache is for? [20:55:16] idk? [20:57:58] hmm, if we can set CONN_TRX_AUTOCOMMIT on the loadbalancer it should avoid that spurious begin transaction [20:59:24] weird, it looks like it SHOULD already have that by virtue of being SQLite [20:59:59] oh wait, no it CLEARS that on SQLite [21:00:04] uggg [21:00:11] so confused [21:00:21] ok, gotta replicate prod config locally [21:01:28] ejegg: just a thought, not to burden u further on this, but if u see a quick way to make a prod-replicated setup the default on fr-dev that'd be sweet eh [21:03:47] ejegg: btw also a quick way to add stuff to your Docker images is with a build directive in the docker-compose-override file, pointing to a local add-hock Dockerfile [21:04:05] and that'll just add an additional layer onto the existing image with whatever extra stuff you'd like to add in [21:08:25] yah yah, i think the quick way would be to check in a sqlite file to the config dir [21:08:41] so no need for image changes [21:08:46] just in the fundraising-dev repo [21:09:14] Just trying with the prod sqlite file locally now, i can load the page but all the css + js is gone [21:09:19] ejegg: uncommented: $wgReadOnly = 'MediaWiki is in read-only mode'; [21:09:26] ok dwisehaupt, thanks! [21:10:42] the thought on sqlite is that we are hoping to not have to spin up services if they aren't needed. ie: do we need the full weight of a mariadb instance for this project or will the smaller sqlite suffice. [21:10:57] bah who needs that css and js stuff anyway [21:11:03] we can go retro and just be html1 [21:11:36] dwisehaupt: I think mariadb definitely not needed... heheh I hope! [21:16:05] ok, I can get an RDBMS error locally on submit [21:16:12] using sqlite and readonly=true [21:17:15] ejegg: fantasmic! [21:19:13] pasted the full trace at the bottom of the etherpad [21:19:17] will format a bit [21:21:36] :) :) [21:21:49] guhh [21:22:05] can we just set config to ignore the db user blocks? [21:29:16] ups, that seems to be the wrong execption there [21:32:27] OK, apparently BEGIN DEFERRED is ok for SQLite but BEGIN IMMEDIATE is not [21:32:41] and when we just load the form it uses BEGIN DEFERRED [21:32:51] while when we submit the thing it's BEGIN IMMEDIATE [21:33:05] huh crazy [21:33:37] dwisehaupt: any reason I can't check that donorwiki sqlite database file into a public repo? [21:33:50] hmmmm... i am not sure. [21:34:07] i'm not certain what is stored in there to be honest. [21:36:21] ejegg: hey sorry to be just here in the peanut gallery, um I feel like we should instead re-create it locally, and have a known path for doing so? [21:36:50] it can be checked out, but it should be a locally created one? [21:43:16] ok sure AndyRussG [21:44:37] thx! :) [21:50:49] guh, even a locally-generated one is missing css + js [21:56:28] ah, it's that same ResourceLoaderUseObjectCache setting [21:56:54] ok, got it submitting with a sqlite db [21:56:59] but not readonly [21:57:21] so payments is also readonly [21:57:34] but I guess readonly mysql doesn't barf on BEGIN IMMEDIATE [21:57:47] feeling like a mw core issue? [21:58:27] like, when db is sqlite and readonly is set, they shouldn't even try to BEGIN IMMEDIATE [21:58:45] https://sqlite.org/forum/forumpost/1181a6de39?t=h&hist [21:59:22] ok, maybe a sqlite bug ^^^ [22:00:34] uff... [22:01:01] "IMMEDIATE cause the database connection to start a new write immediately, without waiting for a write statement" [22:01:13] so hmm, maybe it should barf when sqlite is readonly? [22:01:18] so, looks like the ingenico issue was their dns provider (neustar/ultradns) having issues. [22:01:28] ah fun [22:03:09] dwisehaupt: ahhh, wait, that's mentioned in the config [22:03:22] 'trxMode' => 'IMMEDIATE' [22:03:31] let's try setting that to something else [22:03:34] lemme see [22:04:14] oh wait, but only for the other three dbs [22:04:30] lemme see if I can find a setting to disable that [22:08:26] oh bother, no, for the main db that's determined here: https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/db/MWLBFactory.php$203 [22:08:41] AndyRussG: ^^^ is why the thing crashes on POST and not on GET [22:10:41] dwisehaupt: we could kill another day or two trying to work around this in code [22:10:53] how much work is adding another readonly local mariadb? [22:27:38] hmmm... getting the db software isn't too bad. i'm not sure if we'd have to do something with the mw instance to get it to populate correctly. [22:28:29] given that we would run it standalone to start, it would probably be fully done midday/early afternoon tomorrow. [22:28:53] long term we would want to replicate it to the host in codfw but that can be done separately. [22:30:20] jeff did the mw setup side of this so it would be a training up for me to do it. [22:34:22] i'll start a branch for the config in case we want to go that route. [22:48:08] K finally almost done this non-work stuff... [23:42:07] dwisehaupt: I believe you could just copy the db data files from paymentswiki [23:44:07] that'd be cool if the data will work. [23:44:35] i'm prepping the puppet repos now. many bits of moving parts. [23:48:27] ok. i need to get dinner rolling for the family. [23:48:43] i've passed the branch on to jeff for review and possible pushing. [23:52:32] thanks dwisehaupt