[00:14:53] <wikibugs_>	 10Traffic, 10netops, 10Operations, 10ops-eqiad: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3564595 (10RobH) a:05Cmjohnson>03RobH I've contacted Dasher about this system failing to take updates, will update task when I have more.
[00:57:43] <wikibugs_>	 10Traffic, 10Operations, 10ops-eqiad: cp1053 possible hardware issues - https://phabricator.wikimedia.org/T165252#3261314 (10Cmjohnson) @bblack The server is out of warranty but we could try and re-do the thermal paste.
[08:22:01] <ema>	 one-packet scheduling PR merged: https://github.com/facebook/gnlpy/pull/23
[08:39:13] <elukey>	 is it the library used by pybal?
[09:17:20] <ema>	 elukey: by pybal 2.0, yes :) 
[09:20:31] <elukey>	 pybal-ng :D
[09:20:31] <wikibugs_>	 10Traffic, 10Operations: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3565123 (10ema) p:05Triage>03Normal
[09:40:46] <wikibugs_>	 10Traffic, 10Operations: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3565159 (10fgiunchedi) Yes the are LVS-specific in the sense that the metrics backing the graphs come from `/proc/net/ip_vs*` and thus only for ipvs-managed services, and indeed for lv...
[09:49:47] <wikibugs_>	 10Traffic, 10Operations: Unclear LVS bandwidth graph in "load balancers" dashboard - https://phabricator.wikimedia.org/T174432#3561810 (10ema) >>! In T174432#3562830, @BBlack wrote: > Are the non-icmp graphs somehow LVS-specific?  Yes, the metrics are: node_ipvs_backend_connections_active, node_ipvs_incoming_p...
[09:51:34] <ema>	 all cache nodes upgraded to varnish 4.1.8-1wm1
[10:43:57] <wikibugs_>	 10Traffic, 10Operations, 10Wikidata, 10wikiba.se, 10Wikidata-Sprint-2016-11-08: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#1293753 (10Lydia_Pintscher) After discussion with Faidon at Wikimania we agreed: * hosting can move now * domain is registe...
[14:55:14] <bblack>	 on the vcl_config scoping thing.  the fix seems correct in the moment, but it's also wrong in a way that points out existing past wrongness
[14:55:37] <bblack>	 in that common::vcl is shared between fe+be instances, and we're now fixating it on the fe's values
[14:56:27] <bblack>	 the core existing issue is that vcl_config is specified per-instance (fe vs be), but used from shared VCL files in place
[14:56:57] <bblack>	 (well and in general, vcl_config is kind of ugly)
[14:57:33] <bblack>	 I'm not even sure what the right-est answer is
[14:59:32] <bblack>	 looking at another angle on the same thing: I think the problem here only really exists in the form of analytics.inc.vcl.erb (which comes from common::vcl) using vcl_config to get access to "top_domain"
[14:59:59] <bblack>	 which is the defaulted in the template (via fetch() args) to "org", and only specified for the text cluster, so I guess it's just broken for upload on beta
[15:00:19] <bblack>	 (it's used by all clusters' frontends)
[15:01:39] <bblack>	 one could make the argument that top_domain doesn't belong in vcl_config at all, which gets around this in a different way
[15:02:08] <bblack>	 really it doesn't even belong to role::cache::text, it belongs in some more-abstract namespace for all cache clusters, differentiating/defaulting based on prod-vs-betacluster
[15:07:53] <bblack>	 but whatever, if the current change gets past a futureparser issue that's fine for now, we can always revisit this later during some future refactor
[15:08:03] <bblack>	 I don't see any extremely simple fix there
[15:09:50] <bblack>	 basically common::vcl's templated instance-shared files shouldn't have access to per-instance data like fe or be's vcl_config.  and shouldn't need it, because any variables it needs access to properly belong at a higher scope than per-instance (per-cluster or global to all clusters)
[16:42:10] <_joe_>	 bblack: I agree, mostly, but I'm in back-to-back meetings for another hour and something
[16:42:17] <_joe_>	 but I have good news for you
[16:42:40] <_joe_>	 Aaron worked on purge rates, and it had some important consequences, the purges rate dropped by ~ 70% 
[16:43:21] <_joe_>	 see https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?panelId=6&fullscreen&orgId=1&from=1503182795274&to=1504111110134&var-site=All&var-cache_type=text&var-status_type=1&var-status_type=2&var-status_type=3&var-status_type=4&var-status_type=5
[16:48:45] <wikibugs_>	 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3566722 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp4021.ulsfo.wmnet', 'cp4022.ulsfo.wmnet', 'cp4023.ulsfo....
[16:51:07] <wikibugs_>	 10netops, 10Operations, 10ops-codfw: Power alarm flap on asw-d-codfw:et-7/0/52 channel 3 - https://phabricator.wikimedia.org/T174366#3566753 (10ayounsi) 05Open>03Resolved a:03ayounsi  Papaul replaced the optic on the switch side, levels back to normal: ``` > show interfaces diagnostics optics et-7/0/52...
[17:26:58] <bblack>	 https://edgemesh.com/ is really interesting
[17:28:00] <bblack>	 I suspect we wouldn't use it as-is (their commercial network), there's a lot of thorny issues around privacy and functionality they don't explain well...
[17:28:25] <bblack>	 but the concept is interesting, and maybe an open-source variant could exist that we manage ourselves
[17:30:02] <bblack>	 the idea is you have your clients run a serviceworker in their browser.  these serviceworkers form a global mesh network using WebRTC (like video chat, etc) to communicate with each other, but they're actually tunneling arbitrary other stuff within a WebRTC wrapper.  for commonly-cacheable content (e.g. images), the mesh network tries to side-load assets from other nearby clients (e.g. on the sam
[17:30:09] <bblack>	 e last-mile network) instead of from our "real" origins when that's possible and helps latency.
[17:31:14] <bblack>	 and then there's of course gobs of tiny details to work out there, about managing the mesh network, not stalling out users because they happened to be fetching from a peer that just closed their laptop, measuring the latency benefits accurately to each client, broadcasting out purges when necc, etc
[17:32:17] <bblack>	 it's sort of like BitTorrent, but for cacheable multimedia content browsers are viewing, built out of serviceworkers + webrtc connections, and with some central management to cover all the strange edge cases.
[17:49:26] <wikibugs_>	 10Traffic, 10Operations, 10ops-ulsfo, 10Patch-For-Review: replace ulsfo aging servers - https://phabricator.wikimedia.org/T164327#3567032 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by bblack on neodymium.eqiad.wmnet for hosts: ``` ['cp4022.ulsfo.wmnet'] ``` The log can be found in `/var/lo...
[18:07:06] <_joe_>	 yeah I thought abot using bittorrent-like DHTs for non-local caching for a long time
[18:07:40] <_joe_>	 but all distributed networks like those usually have woeful performance and are have more of an anti-censorship focus than anything
[18:07:53] <_joe_>	 this is an interesting different take on the subject
[18:07:57] <_joe_>	 from a different angle
[18:17:24] <wikibugs_>	 10netops, 10Cloud-VPS, 10Operations: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596#3567207 (10Krenair)
[18:18:46] <wikibugs_>	 10netops, 10Cloud-VPS, 10Operations: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596#3567234 (10Krenair)
[18:19:58] <wikibugs_>	 10netops, 10Cloud-VPS, 10Operations: dmz_cidr only includes some wikimedia public IP ranges, leading to some very strange behaviour - https://phabricator.wikimedia.org/T174596#3567207 (10Krenair) See also T167357 where this task will probably become obsolete, I just wanted to document the effect of this really.
[19:50:02] <XioNoX>	 a lot of ™ on that landing page
[20:57:02] <wikibugs_>	 10netops, 10Operations: set up cr3-esams - https://phabricator.wikimedia.org/T174616#3567954 (10ayounsi)
[20:57:39] <wikibugs_>	 10netops, 10Operations: set up cr3-esams - https://phabricator.wikimedia.org/T174616#3567973 (10ayounsi)
[23:37:47] <wikibugs_>	 10netops, 10Operations: set up cr3-esams - https://phabricator.wikimedia.org/T174616#3568606 (10ayounsi)