[06:39:50] vgutierrez: nice work with ncrecdir :) [06:40:29] hmmm we're getting "a lot" of traffic from wikipedia.co.il [06:42:09] 40k requests since the last log rotation in one node, so 40 requests per second [06:42:12] not bad [06:43:05] or 80 requests per second if we consider both equiad servers [06:43:09] *eqiad [06:44:19] and all of them from the same IP [08:54:08] 10Traffic, 10Operations: Provide prometheus metrics for the ncredir service - https://phabricator.wikimedia.org/T228382 (10Vgutierrez) [08:59:38] I've added traffic folks to https://gerrit.wikimedia.org/r/c/operations/puppet/+/523891 which also gets rid of 5xx alerts based on absolute numbers, please take a look! [09:34:32] 10netops, 10Operations, 10Operations-Software-Development, 10Goal: Configuration management for network operations - https://phabricator.wikimedia.org/T228388 (10Volans) p:05Triage→03Normal [09:46:08] 10netops, 10Operations, 10Operations-Software-Development, 10Goal: Configuration management for network operations - https://phabricator.wikimedia.org/T228388 (10Volans) > [stretch] Evaluate Netbox to store network secrets After playing a bit with secrets in our Netbox test box I've come to the conclusio... [09:58:48] ema: hi, in case you are around. Can we get the varnish/ha/nginx stack to just be a pass through to a backend http server? Ie without any caching :] [09:59:34] I might have the use case of hosting some git repositories over http and might want to have them behind varnish (I will fill a task for that anyway at some point) [10:12:25] hashar: sure! If possible, the origin server (in this case the git hosting thing) should indicate that responses must not be cached (see https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control) [10:13:59] hashar: if that can't be done, we're a bit sad but still can force no-cache behavior in varnish/ATS for the specific service [10:15:19] ema: cool. I am still not sure how I will expose the git repositories though. But most probably via an Apache frontend which potentially could inject the no cache headers of magic [10:16:00] so seems the shortanswer is that varnish/ats obeys to whatever the origin server indicate and varnish/ats does not magically enforce caching [10:16:04] \o/ [10:16:31] then I cant tell the impact of passing through potentially huge files (eg git pack files :\) [10:30:03] 10Traffic, 10Analytics, 10Analytics-Kanban, 10Operations, and 2 others: TLS certificates for Analytics origin servers - https://phabricator.wikimedia.org/T227860 (10elukey) [10:46:19] 10netops, 10Operations, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10fgiunchedi) 05Open→03Invalid Indeed the solution is to change the syslog cname. [11:04:35] 10Traffic, 10Operations, 10Patch-For-Review: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 (10fgiunchedi) Drive by comment that occurred to me yesterday while looking into {T227668}, during the transition period we'll have to adjust dashboards to account fo... [11:45:10] Hey bblack and ema, I don't know if you know about this RFC: https://phabricator.wikimedia.org/T120085 If implemented, does it cause issues for traffic or VCL? [12:17:43] 10Traffic, 10Operations, 10Performance-Team, 10TechCom-RFC, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10BBlack) I like the end result here, and I don't think it's problematic from the #Traffic perspective in the long view, but I think the... [12:19:59] Amir1: thanks for pointing it out! [12:24:24] 10Traffic, 10Operations, 10Performance-Team, 10TechCom-RFC, and 4 others: Serve Main Page of WMF wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10BBlack) Oh one more thing that should've been (3) on that list: I'm pretty sure UAs cache 301s "Permanently" as indicated, so there's... [12:34:09] hi all just a heads up that im about to upgrade mtail on all the cp* servers. this shouldn't cause an issue but wanted to give yuo a heads up. the new version is currently running on cp4031 [12:34:24] ill aim to deployt at ~14:00 to give people a chance to say no [12:34:42] s/14:00/13:00 UTC/ [12:35:08] * jbond42 goddam BST misaligning me with UTC [12:37:05] thanks! [12:43:04] np [12:44:57] 10netops, 10Operations, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10fgiunchedi) 05Invalid→03Open Reopening as I'm seeing syslog traffic from `scs-c1-eqiad.mgmt.eqiad.wmnet` towards lithium even after dns flip [14:14:15] FYI i have manully upgraded cp2026 which seems to be working. I plan to leave this in place (unless there are objections) to see if it remains stable and to help troubleshoot. Also of note is that cp1085 always dies [14:16:50] there's shouldn't be any notable diffs in our software config between machines. is it possible some mismatched dependencies are deployed? [14:17:06] (re: cp1085 acting different) [14:17:26] bblack: this is a go package which has all its libraries staticly linked [14:17:37] still, could be underlying shlibs [14:18:10] we probably haven't been great about routinely doing "apt upgrade" for all the non-critical fixups over the past several months [14:18:25] which may have lead to some package version disparities depending on the age of the last reimage of the machine. [14:18:28] not sure moritzm godog are probably better placed to anser if thats could be the issue [14:20:16] the cp* hosts are fully up-to-date wrt stretch, the only thing pending there is mtail [14:21:03] was about to link https://debmonitor.wikimedia.org/hosts/cp1085.eqiad.wmnet :) [14:21:40] (select show only upgradable packages) [14:22:05] nice [14:23:20] what's interesting, is they all only have one pending upgrade, but there are some distinct classes of total-count-of-packages [14:23:42] range of 491 - 602 :) [14:23:46] in fact it look slik it just takes longer to die on some servers then others. so it could be something its struggleing to parse which happens more frequently on different servers [14:23:53] some of that could be text-vs-upload though [14:24:11] maybe need some apt autoremoves too [14:25:35] fyi i have also rolled back cp2026 now as well [14:28:33] yeah going to clean up autoremove just because I'm looking. I doubt it's related [14:28:45] ack cheers [14:30:53] yeah still a fair amount of pkgcount discrepancy. some of it's ok (text vs upload), a lot of it's probably when we've installed one-off debugging tools on a host. some of it may be age-of-last-reimage. [14:31:20] it'd be nice to reach a point where we could just automate routine reimaging for these. [14:31:45] (e.g. nothing is older than ~30d from its last reimage, because some tool goes around auto-reimaging them with depools on such a schedule) [14:32:24] main blocker is probably our awful cp puppetization and its horrid dependency loops and breakage on initial post-imaging puppetization. [14:32:46] hopefully ATS conversion will continue cleaning up some of the worst of that :) [14:34:16] 10Traffic, 10Operations: Separate Traffic layer caches for PHP7/HHVM - https://phabricator.wikimedia.org/T206339 (10akosiaris) 05Open→03Resolved a:03akosiaris I think we can resolve this, right? I am gonna be bold and resolve it, feel free to reopen if needed [14:38:25] the auto rebuild thing would be nice but as you say puppet policy [14:39:52] you could think of it as a polite chaosmonkey :) [14:40:16] :) [14:51:46] lol [15:12:41] 10Traffic, 10Operations, 10Phabricator: Make phame cacheable - https://phabricator.wikimedia.org/T219978 (10epriestley) > I'm not sure what the purpose of the cookie is... This cookie mostly supports CSRF protection for login attempts (`phsid` is "**PH**abricator **S**ession **ID**"), and prevents an attack... [15:13:23] 10Traffic, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: server-cache did neither update on uploading nor with ?action=purge - https://phabricator.wikimedia.org/T228433 (10JoKalliauer) [16:07:33] 10netops, 10Operations, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10fgiunchedi) Starting today at `12:48` until now, lithium has seen a variety of traffic towards ports 514 and 10514 from these devices, not sure if dns for syslog is never lo... [22:00:52] 10netops, 10Operations, 10ops-codfw: Cable mr1-codfw<->cr1/2-codfw through asw-a-codfw - https://phabricator.wikimedia.org/T228112 (10Papaul) @ayounsi please pick a day and time sometimes next week (Mon - Wed) and let me know. Thanks [22:03:53] 10netops, 10Operations, 10ops-codfw: update RE-S-X6-64G-S in cr[12]-codfw - https://phabricator.wikimedia.org/T226422 (10Papaul) @ayounsi please pick a day and time sometimes next week (Mon - Wed) and let me know. Thanks [22:25:04] 10netops, 10Operations, 10User-fgiunchedi: Use centrallog1001 for network devices syslog - https://phabricator.wikimedia.org/T228275 (10ayounsi) Only solution I found so far on Juniper is to deactivate/activate that syslog target (tested with cr2-esams).