[08:34:35] 10Domains, 10Traffic, 06Operations, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3124283 (10Beetlebeard) Thanks alot. G Suite is verified. All should work, as it it supposed to. Cheers [14:09:43] bblack: FYI failoid deployed in it's final state raising connection refused on 80,443, if we need to add ports to the list it's in hiera [14:11:10] volans: yeah technically we might have to. also, technically we could code-derive the port list... [14:11:46] from the lvs config? [14:12:30] basically if you iterate discovery::services, use the "lvs" key within each service as a reference into lvs::configuration::lvs_services, they each have a "port" attribute [14:13:00] e.g. restbase is actually on port 7231 [14:13:06] yeah I was thinking the same approach [14:13:16] (don't ask me why. I think it would make more sense if all internal services used the standard http/s ports :) [14:13:28] the original requirement was http(s) starndard ports :D [14:27:17] <_joe_> well there is a pretty obvious problem: we would need a local tls terminator on the scb cluster with proper virtualhosts [14:27:38] <_joe_> and tie every service toghether would be horrible [14:28:20] <_joe_> we do LVS-DR so we can't do port mapping on the load balancer [14:28:35] <_joe_> this might change once we have kubernetes of course [14:34:07] huh? [14:34:17] just give them separate virtual IPs [14:34:40] just because 10 services live on scb1001 or whatever shouldn't mean they have to share an IP [14:42:58] bblack: it looks like ATS has no obj.hits equivalent unfortunately [14:43:54] there is however a plugin to deal with N-hit-wonders https://docs.trafficserver.apache.org/en/latest/admin-guide/plugins/cache_promote.en.html [14:44:40] how sad are we about lack of obj.hits? [14:46:56] well it's mostly of interest for stats/analysis/tuning/reasoning [14:47:05] it's not like we rely on it functionally [14:47:48] (FWIW, I think if I were designing a server I wouldn't track per-object lifetime hits either, that sounds like something it's hard to do in any simple way that doesn't cause pointless writes on what should be read ops) [14:47:49] we do look for hit/4 or higher on text/upload frontends [14:48:23] yeah that hit/4 logic is our halfassed N-hit-wonder, but I think the better approach we were looking forward towards didn't use it anyways? [14:49:19] oh wait, even the better logic still uses hitcount [14:49:31] I think [14:50:15] no, I'm wrong (well wrong the second time, right the first time?) [14:50:26] :) [14:50:27] the better logic proposed by dsb was based purely on size and a constant [14:50:42] chance of entering cache = exp(-size/c) [14:50:52] where size is object size, and "c" is some carefully-tuned constant [14:51:36] the n-hit part of the equasion becomes probabilistically-implicit [14:52:39] OK, so we go ahead and implement the policy based on object size instead [14:53:00] right [14:53:44] the idea of "cache_storage_chance = exp(-size/c)" is that a very large object might have a 5% chance of cache entry each time it's seen, and a very small object might have an 80% chance, or whatever [14:54:07] and you're just relying on how those statistics work out to store huge objects only if they're hit a lot, with a lower hits threshold for smaller objects. [14:55:13] and by doing it probabilistically like that, you avoid having to maintain some kind of internal per-object state tracking (hit counters, which get reset occasionally anyways) [14:57:16] while at the same time winning a bunch of elegance points! [14:57:18] (there's a downside to any of these things though: they mean more data transfer to refill cold caches. we might ideally want "c" to auto-tune itself a bit during early uptime. e.g. chances get much higher than normal at uptime=0, and drop back off to the normal constant over the first 10 minutes or something. [14:57:43] that sounds much easier to do in lua than VCL, though! :) [14:58:53] yeah there are multiple options to tune 'c', we could perhaps use the hitrate as well [15:04:50] bblack: in lvs::configuration::lvs_services: when there is no port it defaults to 80 right? [15:15:03] volans: seems so [15:15:42] now this makes me question our data model. do any internal services have duplicate lvs stanzas, like the public "text" vs "text-https", for dual-proto? :) [15:15:53] seems so :D [15:16:03] ah yes, apaches vs apaches-https [15:16:34] well, we can ignore the problem for now I guess and hack a list of ports? I donno [15:16:39] the dns part doesn't care about ports [15:17:06] probably the more-ideal way to fix that data model is to get rid of the duplicity of e.g. "text" vs "text-https" [15:17:15] point being that if I get the list dynamically I'm not getting 443 because apaches-https is not mentioned [15:17:30] I'm getting # [15:17:36] and instead of have "text" have data within it that defines a set of possibly more than one port, each having metadata about protocol and healthchecks separately... [15:17:45] * volans still checking the correctness [15:18:02] sounds like a big refactor [15:18:06] yes :) [15:18:19] so I guess will not gonna happen right now :D [15:18:24] but for now, you can reach a hackier answer by just merging the set you find with {80, 443} [15:18:52] and hope we don't have cases like restbase=>7231 + restbase-https=>7232 [15:18:52] or check both "key" and "key-https" [15:19:04] yeah, we could also implicitly assume that "-https" scheme [15:19:27] search-https port 9243 [15:19:30] there we go :D [15:19:39] or, we can keep failoid hosts having no explicit rules about ports [15:19:59] it only has the base role, which presumably doesn't have conflicting services (as those would break the real service nodes that also use the base role) [15:20:04] and let port rejections happen "naturally" [15:20:36] yes but right now has ::base::firewall so I need to specify them [15:20:45] given that the idea is to have the firewall more or less everywhere [15:20:50] does base::firewall do a drop instead of reject or something? [15:20:57] yep default policy DROP [15:21:01] I specify them as reject [15:21:06] maybe just change the default? [15:21:10] (for failoid) [15:21:17] let me check [15:21:28] (or if not the actual default, insert a custom ferm rule at the end that rejects all tcp before the default drop) [15:22:37] (or if all-ports rules aren't possible at the end, maybe one that explicitly specifies a port range/set like "80,443,1024-65535") [15:23:43] yeah, let me see how the order is done [15:25:12] or quicker... moritzm any suggestion? (last 11 lines for context ) :-P [15:27:16] hopefully we'll not use discovery for git-ssh (port 22) ;) [15:28:07] you can't easily change the default policy with our current ferm puppetry, it's hardcoded to DROP in 00_main [15:28:21] moritzm: but can I add a rule at the very end? [15:28:49] to drop all tcp or tcp 80,443,1024-65535? [15:29:08] I see that the rule I've added is in the middle [15:29:39] yeah, you could create one with setting "prio" to "99" [15:29:55] ok, let me try that then [15:29:55] prio is usually unset for most ferm services and remains at 10 [15:31:51] moritzm: you ok with an all TCP REJECT? or you want them specified ? [15:32:02] (them = the ports) [15:33:06] if it's with prio 99, so that other ports like SSH are passed before, then covering all ports is fine [15:33:16] like: rule => "proto tcp REJECT;" [15:33:34] ok thanks [15:34:54] 10Traffic, 10DNS, 06Operations: AuthDNS CM/CI refactor - https://phabricator.wikimedia.org/T161148#3125387 (10BBlack) [15:36:04] just wait till one of the services we want in discovery/failover is git-ssh (phabricator's git endpoint) on port 22, that will be fun :) [15:36:21] bblack: see above :D [15:36:34] volans| hopefully we'll not use discovery for git-ssh (port 22) ;) [15:36:42] :) [15:37:14] actually... it will work [15:37:21] probably the most-general answer would be to put a separate virtual IP on the failoid hosts for the failoid service itself, and block all TCP with reject on *that* IP, before all other ferm rules instead of after [15:37:25] because with base firewall we allow 22 only from some specific hosts [15:37:35] oh, good point, but still :P [15:38:22] yeah [15:49:52] bblack: done, all tcp is rejected :) [15:50:52] \o/ [15:51:01] 0 0 REJECT tcp -- * * 0.0.0.0/0 0.0.0.0/0 reject-with icmp-port-unreachable [15:51:04] did _joe_ merge the patches with the other new services defined yet? [15:51:24] <_joe_> bblack: nope [15:51:44] ok [15:51:56] I have a meeting soon, but after I'll manually review/validate them [15:52:22] <_joe_> bblack: I am working on making redis replica depend on etcd directly [15:52:25] I mean, they're simple enough, but there's no automated CI check that confirms the two commits match, really, so it's worth it. [16:06:19] <_joe_> yeah :) [17:08:29] _joe_: in https://gerrit.wikimedia.org/r/#/c/344088 are the conftool services and discovery::services supposed to end up the same? it seems like they don't. swift-r[ow] is added to discovery but there's no swift in conftool, (ditto imagescaler). apertium and aqs is in conftool but not discovery, etc... [17:09:02] <_joe_> bblack: uhm, those are in another conftool file [17:09:07] <_joe_> mediawiki.yaml in the same dir [17:09:20] <_joe_> so everything in discovery MUST exist in conftool [17:09:29] <_joe_> but not necessarily vice-versa for now [17:09:32] well there's mismatches in both directions [17:09:45] <_joe_> yeah told you I have to re-check that [17:09:46] oh nevermind [17:09:54] I see what you're saying now [17:10:01] <_joe_> I'm pretty near to the end of my epic redis refactoring [17:10:27] so I'll just worry about discovery::services <=> dns-commit [17:14:52] added review comment there [17:14:56] (the dns commit) [17:16:34] <_joe_> thanks, will look at it tomorrow I guess [18:05:47] bblack: I'm looking at merging https://gerrit.wikimedia.org/r/#/c/344197 (T159574) [18:05:48] T159574: LDF endpoint ordering is not stable between servers when paging - https://phabricator.wikimedia.org/T159574 [18:06:16] bblack: I'm not super happy with the single backend server, but since you seem to +1 it, that should be good enough for me :) [18:08:11] just a check (I'm not entirely comfortable touching varnish), this should work with just a merge, and applying puppet on one varnish server in eqiad to check it is working correctly, and let puppet run its way for the rest, right? [18:19:24] gehel: well it's not a great solution in general, but I think the subpath approach is preferable to the other immediate alternate, which is to go single-server / manual failover for the whole service instead of just one subpath. [18:19:55] Definitely not as bad! [18:21:06] bblack: so the deployment of that config change should be straight forward, right' [18:21:13] but the right answer in the long term is to fix it in the application (or at least, the application's own stack, which would be the "nginx hashes on X-Client-IP" sort of approach) [18:21:21] gehel: yes, it should be :) [18:22:18] usually the worst-case fallout with these things is the resulting templated VCL output is bad, which causes VCL to fail to reload, which causes puppet agent to fail, etc [18:22:32] but the previous last-known-good config keeps chugging for runtime stuff [18:22:57] the commit isn't doing anything new we haven't done in other req_handling directives, so the templating should handle it fine [18:24:31] ok, so I'm pushing it now... [18:30:11] bblack: "[WARN/control] unable to connect to socket /var/run/lldpd.socket: No such file or directory" when running puppet on cp1051 (run is green otherwise) [18:32:03] I have no idea what's going on there, unrelated to varnish directly, though [18:32:17] is it happening everywhere? [18:32:21] yeah, lldpd is not running [18:32:45] what emitted the warn/control line? [18:33:03] (because i don't see it in puppet's syslog lines) [18:33:15] only on cp1051 (I checked only 1051 and 1045) [18:33:21] bblack: in puppet.log [18:33:58] https://phabricator.wikimedia.org/P5119 [18:34:12] is there since the start of the file, so a bunch of puppet runs to say the least [18:34:14] probably facts trying to get the list of interfaces [18:34:27] hmmm [18:34:40] I don't understand enough about lldpd to touch it :( [18:34:41] it's not apt-get update [18:34:54] yeah facts, ok [18:35:58] there's no icinga check for lldpd [18:36:03] so I guess we don't monitor it [18:36:14] cp1061 also has lldpd not runnint [18:36:18] running [18:37:21] it's been dead a long time, no evidence left in logs [18:37:29] Loaded: loaded (/lib/systemd/system/lldpd.service; disabled) [18:38:54] bblack: disabled on cp2009.codfw.wmnet,cp[1051,1061].eqiad.wmnet [18:39:02] yeah I'm checking all the caches now [18:39:07] enabled on the other 98 cp* [18:39:10] ok [18:39:26] I'm not of much help there... sorry [18:40:53] checking the fleet in general now [18:42:20] {'cp1051.eqiad.wmnet': 'LLDPD NOT RUNNING'} [18:42:20] {'cp2009.codfw.wmnet': 'LLDPD NOT RUNNING'} [18:42:20] {'cp1061.eqiad.wmnet': 'LLDPD NOT RUNNING'} [18:42:22] bblack: cp2009.codfw.wmnet,cp[1051,1061].eqiad.wmnet,elastic2020.codfw.wmnet [18:42:29] of the jessies [18:42:34] ^ those are the only ones on the whole fleet as far as salt knows [18:42:39] sudo cumin 'F:operatingsystem = Debian and F:operatingsystemmajrelease = 8' 'systemctl is-enabled lldpd' [18:42:54] you missed the elastic2020 :D [18:42:55] I still haven't made my mental switch to cumin! :) [18:43:19] bblack: don't worry, I've also not pushed for it yet ;) [18:43:19] elastic2020 isn't even in my list of "all hosts" for salt heh [18:43:38] probably salt-minion issue or random fail/timeout [18:44:25] FWIW, my generic check was: pgrep lldpd >/dev/null || echo LLDPD NOT RUNNING [18:45:05] I was first checking if was enabled in systemd, then checking if it's running [18:45:56] let me run that too [18:46:21] bblack: T149006 [18:46:22] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [18:46:24] I can't even reach elastic2020.codfw.wmnet with regular ssh [18:46:24] :D [18:46:39] the other 3 cache hosts, there's no evidence left in logs of when lldpd died [18:47:00] I'm going to try just starting it and see what happens [18:47:32] Mar 23 18:47:10 cp2009 lldpcli[58073]: unknown command from argument 1: `#` [18:47:35] Mar 23 18:47:10 cp2009 lldpcli[58073]: an error occurred while executing last command [18:47:38] Mar 23 18:47:10 cp2009 lldpcli[58073]: unknown command from argument 1: `#` [18:47:41] Mar 23 18:47:10 cp2009 lldpcli[58073]: an error occurred while executing last command [18:47:48] otherwise seems to stay alive though [18:47:58] lol [18:48:19] lldpd.service - LLDP daemon [18:48:19] Loaded: loaded (/lib/systemd/system/lldpd.service; disabled) [18:48:19] Active: active (running) since Thu 2017-03-23 18:47:10 UTC; 29s ago [18:48:26] why is it still "disabled"? [18:48:37] systemctl enable lldp [18:48:46] systemctl enable lldpd [18:48:51] apparently was disabled [18:48:53] root@cp1065:~# systemctl status lldpd [18:48:53] ● lldpd.service - LLDP daemon Loaded: loaded (/lib/systemd/system/lldpd.service; enabled) Active: active (running) since Mon 2017-02-06 16:38:05 UTC; 1 months 14 days ago [18:48:56] not sure if can be disabled automatically on failure [18:48:59] it's enabled on a normal host [18:49:03] yes [18:49:16] all the others, the check I've run with systemctl is-enable lldpd [18:49:39] it's not managed directly in puppet, either [18:49:45] it's installed at installation-time and left alone [18:50:00] modules/base/manifests/standard_packages.pp: 'lldpd', [18:50:02] modules/install_server/files/autoinstall/scripts/late_command.sh:# lldpd: announce the machine on the network [18:50:03] modules/install_server/files/autoinstall/scripts/late_command.sh:apt-install openssh-server puppet lldpd [18:50:30] we have it also in base::standard_packages [18:50:44] oh yeah you already mentioned it [18:51:23] oh well, enabled+started on the 3x caches [18:51:45] eheheh [18:58:51] bblack, volans: thanks for loooking into this! [18:59:28] :) [19:04:54] bblack: back to the LDF issue, hashing on client IP behind varnish will not help, unless we also disable caching... [19:07:35] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 3 others: LDF endpoint ordering is not stable between servers when paging - https://phabricator.wikimedia.org/T159574#3126038 (10Gehel) Varnish patch deployed. I'll keep an eye on logs to make sure all request are routed as we expect. We still need to f... [19:08:31] gehel: good point :) that would've applied to using varnish direct-definition to hash to the backends too heh [19:09:33] the sane way should be to have wdqs serve consistent content, but that's not easy to do either... [19:09:53] I'll open a phab task and hope we have a good idea at some point... [19:10:01] unless you do something crazier, like have varnish extract the parity of the IP address as a backend-chooser (between two backends) at request-arrival time, and both choose the backend based on that and vary the cache onit [19:10:15] but we're probably not going down that road in varnish-land, it's kinda messy [19:10:58] there is a 1:1 mapping between nginx and wdqs (nginx only connects to the local wdqs), so that might work [19:11:45] the bottom-line way to think of the constraints, is that applayer services need to function correctly for clients given a randomizing LVS in front of their instances. [19:11:54] if they're giving inconsistent results for the same query, fail, basically. [19:12:22] yeah, but in this case that constraint seems fairly hard to implement... [19:12:35] even in the case where we do legitimate client-ip hashing at the LVS edge, in front of the caches, it's understood this is an optimization and we're not relying on it for correctness [19:12:57] it's only hard to implement because you've decided up-front that the service is stateless, and it turns out it's not stateless [19:14:53] agreed, it is a design failure, mainly of code that isn't ours and that we don't control well enough (which is another failure) [19:15:50] upgrading our lvs/pybal setup to support a true active/passive arrangement within a DC is an option, too [19:16:28] although I don't know that we'd do automated failover in that case (probably too much false-positive). you'd still have to take a step to fail it over, but maybe it's a confctl step instead of a commit to puppeted varnish or lvs config [19:17:37] still, it feels wrong to be doing new work in the direction of better support for things that aren't active/active even locally in a DC, when the overall trend we're aiming for is everything active/active even at the geographic scale (multi-dc) [19:18:02] and that makes the service unscalable... [19:18:15] which might not be a short term issue, but also feels wrong [19:18:17] no, the service was unscalable to begin with [19:18:55] remove the entire traffic layer and imagine you're just exposing wdqs directly to the wild via LVS. How do you handle this case then? [19:19:24] you could ask LVS to chash on client-IP, but again it's just an optimization. A blip in a healthcheck failure could send clients to the "wrong" side temporary and break user-facing functionality. [19:20:02] hashing on client IP is not a solution for application-internal problems with lack of state management or interface inconsistency [19:20:07] my bad, not "make" unscalable, but "does not address" the current scalability issue [19:20:46] one more reason to not put in place a solution at traffic layer [19:21:05] yes [19:21:14] the last thing we want is more one-off hacks there to support quirks of specific apps [19:21:23] (I'm looking at you, MediaWiki) [19:21:48] so yes, I completely agree that this has to be fixed at application layer, I just have no idea how to do it efficiently :) [19:22:53] if we can't solve this problem even for 2x wdqs hosts in eqiad, what's the plan to solve it for active/active wdqs hosts serving traffic in parallel in eqiad + codfw? [19:23:21] 3x (but yes, that's not the point) [19:23:25] users can/will switch between the two DCs on occasion, due to dns routing anomalies or traffic re-routes, etc [19:23:40] and I have no idea yet... [19:23:43] it's just an optimization that most of the time, most users will stick to one DC [19:29:39] so rewinding a bit, the basic issue is you have a paginated series of queries, like "search?page=[12345]", and the underlying sort order that determines which entries are on which page-number is inconsistent between the service nodes [19:29:53] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 3 others: LDF endpoint ordering is not stable between servers when paging - https://phabricator.wikimedia.org/T159574#3071930 (10Gehel) This is done. Longer term solution is tracked on T161240. [19:30:07] right? [19:30:10] right [19:31:10] so " The iteration order is just following the underlying index, which might be different on each node." [19:31:24] right again [19:31:31] this "index", is something dynamically constructed in-memory in the wdqs applayer code? [19:31:48] nope, on disk [19:32:11] but dynamically constructed when data is imported from wikidata to wdqs [19:32:13] but it doesn't build that index until it first gets a paginated query for a given predicate? [19:32:21] or does it build them all up-front? [19:32:53] I don't have the full understanding of all that, so part of it is guess work. [19:32:58] yeah me either [19:33:06] Those indexes are built up front at insert time. [19:33:16] ok [19:33:40] All this is based on Blazegraph, which is a graph oriented database, and uses some of the same construct you can expect from a relational DB [19:33:49] couldn't we have consistent sorting done at insertion time then? instead of "fetch all blah in random order and insert them in this index", "fetch all sorted blah and insert them in this index" [19:34:26] index in this case is probably a BTree or something similar, which does not have an intrinsic order [19:35:10] well, when/where does the ordering/sorting actually happen? it must happen somewhere, because the paginated output has an ordering (which is consistent if you stick to one node) [19:35:59] there is no explicit ordering. There is an iteration order, which is consistent for a given index (provided the index does not change) [19:36:04] (and btree nodes can be consistently linearized I think, if you walk them with a consistent pattern like left-first depth-first or whatever) [19:36:48] they can be consistently linearized (which is the case: we get consistent response from the same server) [19:38:08] is it that the indexes are built (into the wdqs data) by consistent linearization at different points in time as the graph data is changing out from underneath you? [19:38:16] but there is no garantee that all nodes have the exact same BTree (or whatever it is) as its internal state depends on the insert order (so if we reinstall a server at some point, or if one update is missed, or order of insert is not garanteed, or any of the usual reasons) [19:38:56] not sure I understood the question... [19:39:03] me either! :) [19:39:16] good :) [19:39:34] I guess my horrible mental approximation I have right now is this: [19:39:48] pagination of changing data is always an issue (unless you can snapshot that data - which means you have non changing data) [19:40:11] [wdqs applayer code pulls stuff from local-db] -> [local-db is a graph database that's built by querying wikidata] -> [wikidata] [19:41:00] there is mostly no applayer code, the service exposed is the local-db itself [19:41:15] and yeah that's true even for some simple CRUD application with SQL, right? [19:41:27] (that pagination usually gets borked by new inserts) [19:41:51] just maybe not "completely-re-randomzied" sort of borked [19:41:57] there is a pure JS UI, but the interesting use cases at this point are bots or scripts using directly blazegraph [19:42:01] s/zi/iz/ [19:42:43] * gehel was mentioning the pagination of changing data just to make sure we get this out of the discussion [19:43:19] well it's a fair point, how would you handle this in a simple CRUD app with a simple local DB and multiple users inserting into data that other users are querying with paginated queries? [19:44:26] in many cases it's just not a practical issue. update rate is lower than read-rate, and the shifting sorted index you're paginating by doesn't move around much with each insert [19:44:38] if the insert rate is low enough (which is often the case) you ignore the problem as an edge case [19:45:04] sorry, time to got do some cooking and some eating... I might be back a bit later [19:45:10] ok :) [19:45:23] thanks for the discussion! Interesting as always! We should continue it... [19:46:14] I doubt the pagination of moving data problem is a real issue on WDQS. At least much much less than the randomize all pagination! [19:55:19] ah yeah, I was only thinking it might be the reason for this, indirectly (that blazegraph was getting data inserted in a consistent order from wikidata at a single point in time, but the reason two servers do not agree is that they update from wikidata at different points in time as lots of changes happen to the tree) [20:01:23] 10Traffic, 06Discovery, 06Operations, 10Wikidata, and 3 others: LDF endpoint ordering is not stable between servers when paging - https://phabricator.wikimedia.org/T159574#3126145 (10Smalyshev) 05Open>03Resolved [20:37:24] 10Traffic, 06Operations, 06Performance-Team, 06Reading-Web-Backlog, and 3 others: Performance review #2 of Hovercards (Popups extension) - https://phabricator.wikimedia.org/T70861#3126365 (10Nirzar) @phuedx is our total delay more than 1000s? I thought it was around 750? @Peter >0.1 second: Limit for us... [21:28:11] 10Traffic, 10DNS, 06Discovery, 06Labs, and 2 others: multi-component wmflabs.org subdomains doesn't work under simple wildcard TLS cert - https://phabricator.wikimedia.org/T161256#3126688 (10grin) [23:04:34] 10Domains, 10Traffic, 06Operations, 06WMF-Legal, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3127084 (10Dzahn) 05Open>03Resolved Great, thanks for confirming it works. Resolving this ticket now.