[00:14:53] 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3564595 (10RobH) a:05Cmjohnson>03RobH I've contacted Dasher about this system failing to take updates, will update task when I have more. [00:57:43] 10Traffic, 10Operations, 10ops-eqiad: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#3261314 (10Cmjohnson) @bblack The server is out of warranty but we could try and re-do the thermal paste. [08:22:01] one-packet scheduling PR merged: https://github.com/facebook/gnlpy/pull/23 [08:39:13] is it the library used by pybal? [09:17:20] elukey: by pybal 2.0, yes :) [09:20:31] pybal-ng :D [09:20:31] 10Traffic, 10Operations: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3565123 (10ema) p:05Triage>03Normal [09:40:46] 10Traffic, 10Operations: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3565159 (10fgiunchedi) Yes the are LVS-specific in the sense that the metrics backing the graphs come from `/proc/net/ip_vs*` and thus only for ipvs-managed services, and indeed for lv... [09:49:47] 10Traffic, 10Operations: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3561810 (10ema) >>! In T174432#3562830, @BBlack wrote: > Are the non-icmp graphs somehow LVS-specific? Yes, the metrics are: node_ipvs_backend_connections_active, node_ipvs_incoming_p... [09:51:34] all cache nodes upgraded to varnish 4.1.8-1wm1 [10:43:57] 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, 10Wikidata-Sprint-2016-11-08: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#1293753 (10Lydia_Pintscher) After discussion with Faidon at Wikimania we agreed: * hosting can move now * domain is registe... [14:55:14] on the vcl_config scoping thing. the fix seems correct in the moment, but it's also wrong in a way that points out existing past wrongness [14:55:37] in that common::vcl is shared between fe+be instances, and we're now fixating it on the fe's values [14:56:27] the core existing issue is that vcl_config is specified per-instance (fe vs be), but used from shared VCL files in place [14:56:57] (well and in general, vcl_config is kind of ugly) [14:57:33] I'm not even sure what the right-est answer is [14:59:32] looking at another angle on the same thing: I think the problem here only really exists in the form of analytics.inc.vcl.erb (which comes from common::vcl) using vcl_config to get access to "top_domain" [14:59:59] which is the defaulted in the template (via fetch() args) to "org", and only specified for the text cluster, so I guess it's just broken for upload on beta [15:00:19] (it's used by all clusters' frontends) [15:01:39] one could make the argument that top_domain doesn't belong in vcl_config at all, which gets around this in a different way [15:02:08] really it doesn't even belong to role::cache::text, it belongs in some more-abstract namespace for all cache clusters, differentiating/defaulting based on prod-vs-betacluster [15:07:53] but whatever, if the current change gets past a futureparser issue that's fine for now, we can always revisit this later during some future refactor [15:08:03] I don't see any extremely simple fix there [15:09:50] basically common::vcl's templated instance-shared files shouldn't have access to per-instance data like fe or be's vcl_config. and shouldn't need it, because any variables it needs access to properly belong at a higher scope than per-instance (per-cluster or global to all clusters) [16:42:10] <_joe_> bblack: I agree, mostly, but I'm in back-to-back meetings for another hour and something [16:42:17] <_joe_> but I have good news for you [16:42:40] <_joe_> Aaron worked on purge rates, and it had some important consequences, the purges rate dropped by ~ 70% [16:43:21] <_joe_> see https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=6&fullscreen&orgId=1&from=1503182795274&to=1504111110134&var-site=All&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5 [16:48:45] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3566722 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp4021.ulsfo.wmnet', 'cp4022.ulsfo.wmnet', 'cp4023.ulsfo.... [16:51:07] 10netops, 10Operations, 10ops-codfw: Power alarm flap on asw-d-codfw:et-7/0/52 channel 3 - https://phabricator.wikimedia.org/T174366#3566753 (10ayounsi) 05Open>03Resolved a:03ayounsi Papaul replaced the optic on the switch side, levels back to normal: ``` > show interfaces diagnostics optics et-7/0/52... [17:26:58] https://edgemesh.com/ is really interesting [17:28:00] I suspect we wouldn't use it as-is (their commercial network), there's a lot of thorny issues around privacy and functionality they don't explain well... [17:28:25] but the concept is interesting, and maybe an open-source variant could exist that we manage ourselves [17:30:02] the idea is you have your clients run a serviceworker in their browser. these serviceworkers form a global mesh network using WebRTC (like video chat, etc) to communicate with each other, but they're actually tunneling arbitrary other stuff within a WebRTC wrapper. for commonly-cacheable content (e.g. images), the mesh network tries to side-load assets from other nearby clients (e.g. on the sam [17:30:09] e last-mile network) instead of from our "real" origins when that's possible and helps latency. [17:31:14] and then there's of course gobs of tiny details to work out there, about managing the mesh network, not stalling out users because they happened to be fetching from a peer that just closed their laptop, measuring the latency benefits accurately to each client, broadcasting out purges when necc, etc [17:32:17] it's sort of like BitTorrent, but for cacheable multimedia content browsers are viewing, built out of serviceworkers + webrtc connections, and with some central management to cover all the strange edge cases. [17:49:26] 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3567032 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp4022.ulsfo.wmnet'] ``` The log can be found in `/var/lo... [18:07:06] <_joe_> yeah I thought abot using bittorrent-like DHTs for non-local caching for a long time [18:07:40] <_joe_> but all distributed networks like those usually have woeful performance and are have more of an anti-censorship focus than anything [18:07:53] <_joe_> this is an interesting different take on the subject [18:07:57] <_joe_> from a different angle [18:17:24] 10netops, 10Cloud-VPS, 10Operations: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596#3567207 (10Krenair) [18:18:46] 10netops, 10Cloud-VPS, 10Operations: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596#3567234 (10Krenair) [18:19:58] 10netops, 10Cloud-VPS, 10Operations: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596#3567207 (10Krenair) See also T167357 where this task will probably become obsolete, I just wanted to document the effect of this really. [19:50:02] a lot of ™ on that landing page [20:57:02] 10netops, 10Operations: set up cr3-esams - https://phabricator.wikimedia.org/T174616#3567954 (10ayounsi) [20:57:39] 10netops, 10Operations: set up cr3-esams - https://phabricator.wikimedia.org/T174616#3567973 (10ayounsi) [23:37:47] 10netops, 10Operations: set up cr3-esams - https://phabricator.wikimedia.org/T174616#3568606 (10ayounsi)