[09:45:56] 10Traffic, 10Operations: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10ema) [09:46:42] 10Traffic, 10Operations: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962510 (10ema) p:05Triage>03Normal [09:47:58] 10Traffic, 10Operations, 10Ops-Access-Requests: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10ema) [09:54:54] 10Traffic, 10Operations, 10Ops-Access-Requests: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962548 (10ema) [10:04:53] 10Traffic, 10Maps-Sprint, 10Operations: Decide on Cache-Control headers for map tiles - https://phabricator.wikimedia.org/T186732#3962587 (10fgiunchedi) p:05Triage>03Normal [11:13:55] there's a bunch of pybal ipvs UNKNOWNs in icinga btw [11:14:29] I'd guess related to https://gerrit.wikimedia.org/r/c/409338/ [11:14:39] ema: ^ [11:30:44] godog: indeed [11:43:49] godog: https://gerrit.wikimedia.org/r/c/409846/ [11:47:46] ema: yup, LGTM [12:24:30] hashar: hey :) [12:24:56] so, vgutierrez tried to run pcc against a gerrit change, and he got a 403 back from jenkins [12:25:22] does one need to be part of some specific ldap group (perhaps ops?) in order to run pcc? [12:26:11] the error is interesting in itself [12:26:14] jenkinsapi.custom_exceptions.JenkinsAPIException: Operation failed. url=https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/buildWithParameters, data={'json': '{"redirectTo": ".", "parameter": [{"name": "GERRIT_CHANGE_NUMBER", "value": "409844"}, {"name": "LIST_OF_NODES", "value": "cp1008.wikimedia.org,lvs1001.wikimedia.org"}], "statusCode": "303"}', 'LIST_OF_NODES': [12:26:20] 'cp1008.wikimedia.org,lvs1001.wikimedia.org', 'GERRIT_CHANGE_NUMBER': '409844'}, headers={'Content-Type': 'application/x-www-form-urlencoded'}, status=403, text= [12:28:36] well there's problem in general with his LDAP [12:29:06] bblack: morning! :) [12:29:14] we should get the username changed there first before debugging potentially the same issue in tons of different places [12:29:23] (which is that OIT gave him a non-ascii LDAP username :P) [12:30:01] morning :) [12:30:12] I think he now has a ascii-friendly ldap username [12:30:22] non-ascii LDAP... cannot say good morning bblack :D [12:31:21] creating problems already on my first day [12:31:39] yeah I can't see the OIT ticket [12:31:59] are you saying you think they changed it already? [12:32:21] hmm not AFAIK [12:32:43] (because it's also possible that the ascii translation just "kinda works" in some situations due to attempts at compatible support, but until the actual underlying username is fixed, who knows what issues) [12:32:44] to login in the change password tool I need to use the non-ascii [12:32:55] here on gerrit I see a decent ascii username? https://gerrit.wikimedia.org/r/q/owner:vgutierrez%40wikimedia.org [12:33:18] that's a different account, right? [12:33:24] technically, no [12:33:30] but it is a different interface :) [12:33:32] oh ok [12:34:50] vgutierrez: you've already done things with phabricator, created a patch on gerrit, amended it, replied to comments, ... [12:34:58] this was a productive morning! [12:35:12] and time for me to go for lunch :) [12:35:33] I'm cooking some pasta right now :P [12:43:59] ema: so, cp4023 case is interesting, but it's hard to make any definitive conclusions since it's the only one. [12:44:06] (with significant ramping, anyways) [12:44:21] and it hit its weekly restart today already I think [12:44:45] https://grafana.wikimedia.org/dashboard/db/varnish-mailbox-lag?orgId=1&from=1518152070438&to=1518373230649&var-datasource=ulsfo%20prometheus%2Fops&var-cache_type=upload&var-server=All [12:45:22] the ramps didn't get past ~600K, so not horribly awful. there were associated small amounts of failed fetches though, so it is a "real" issue. [12:46:02] since I was here when the first ramp took off, I monitored it for a bit, and then I manually reniced the expiry thread to higher priority, but there's no way to know what impact that had (vs it naturally resolving itself regardless) [12:46:39] the renice would basically be a weaker version of the RT-priority setting, without the lock-prio-inheritance part (I think those aren't usefully settable at runtime or I'd have tried that) [12:47:42] also interesting is that after each ramp, there was an increase in the baseline lag. all the nodes seem to normally be slowly accruing a small single-digit base lag over time. [12:48:00] but 4023 jumped up to a ~6K baseline after the first spike, and by the last that was up to ~14K. [12:48:48] the baselines seem like new behavior, some kind of "leak" from the POV of the varnish stats. I don't know if that means there are mailbox messages actually left suck in the box, or it's more of an accounting leak. The latter seems more likely. [14:03:26] bblack: interesting [14:03:42] bblack: how did you single out the expiry thread id? [14:04:40] ema: if you list threads on the system instead of procs, varnish sets the thread process titles (in top or ps -efL or similar) [14:04:55] the expiry thread is called "cache-timeout" in the thread process title [14:05:08] ooooh cool [14:05:55] last time we did some digging on this I was using systemtap to get the expiry thread id :) [14:05:58] it's kind of interesting, when you look at any cache in top and split out threads with H [14:06:14] the expiry thread always has way more cumulative CPU time than any of the worker threads :) [14:06:37] it's the dusty corner of their design [14:10:34] * volans knowing close-to-nothing about varnish internals wonder if it could have multiple expiry threads based on objects size or hashing [14:27:41] ema hi, i've added Valentín to the trusted user group on phab. [14:28:31] Probaly should include added users to that group as part of the Onboarding thing. as it grants them permissions to do things that users outside the group carn't. [15:12:14] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10Vgutierrez) Public GPG key for pwstore access: ``` -----BEGIN PGP PUBLIC KEY BLOCK----- mQINBFqBcZoBEACfsj5/PYP3lHfihfGWVBDkp4GfB8JFTVTeUQS+r8YDh... [16:26:56] 10Traffic, 10Operations: Migrate dns caches to stretch - https://phabricator.wikimedia.org/T187090#3964044 (10BBlack) p:05Triage>03Normal [16:30:25] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3964083 (10RobH) [16:47:33] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10RobH) So there is typically a shell request task, and then the ops team approves the access to root and ops groups. We went ahead and just had ou... [17:44:21] 10Traffic, 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Ops Onboarding for Valentín Gutiérrez - https://phabricator.wikimedia.org/T187035#3962498 (10Dzahn) added to WMF-NDA-Requests group on Phabricator https://phabricator.wikimedia.org/project/members/974/ [22:58:39] 10Traffic, 10Operations, 10Patch-For-Review: Renew unified certificates 2017 - https://phabricator.wikimedia.org/T178173#3965468 (10RobH)