[08:26:44] morning [08:27:02] yey.. AES128-SHA usage dropped to 0.083% :D [08:41:07] morning! [08:58:20] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install lvs101[3-6] - https://phabricator.wikimedia.org/T184293#4203893 (10Vgutierrez) [09:12:49] ema: what's the proper way of escaping a dot (.) in a VSL query? I'm trying -q 'ReqHeader:X-Client-IP ~ "^10\." but I still get requests from 102.* [09:14:44] vgutierrez: how about 'ReqHeader:X-Client-IP ~ "^10\\."' ? [09:15:19] duh, right, thjx [09:15:22] *thx [09:15:39] yw! [09:15:51] hmmm toolforge traffic will always hit eqiad cache layer, right? [09:17:34] tools.wmflabs.org? No, that should not go through the caches [09:18:49] hmm I meant cp* varnish instances [09:19:01] hmmm [09:19:12] workers on tools.wmflabs.org hitting wikipedia [09:19:27] will go through cpXXXX instances on eqiad [09:20:18] oh, I see, I thought you were wondering whether toolforge itself is behind the caches :) [09:21:00] I'm trying to narrow down several "unknown" bots running on toolforge: T194380 [09:21:01] T194380: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380 [09:21:47] so it's not clear (yet) how I can get historic data from stuff running @ toolforge [09:22:06] but realtime it's pretty easy: https://tools.wmflabs.org/admin/oge/status [09:24:47] so one thing you can do is looking at pivot [09:25:02] https://bit.ly/2jVZ2W1 [09:27:23] that's filtering for UA:DotNetWikiBot and splitting by IP [09:48:43] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4204037 (10Vgutierrez) [09:57:45] ema: o/ - as FYI we are evaluating https://github.com/allegro/turnilo, that is a open-source fork of Pivot [09:58:06] nice! [09:58:42] hey they use "Wikipedia Edits" in the sample screencast :) [09:58:56] as we upgrade Druid we might loose some compatibility with the actual pivot, so turnilo might become a super good alternative soon [09:58:59] it is basically the same [09:59:03] yup.. I was going to ask if the project is related to WMF somehow [09:59:13] ah yes those guys and their data released to the public :D [09:59:47] the UI looks exactly the same as druid's [10:00:04] sorry, pivot's [10:01:27] it's a fork of swiv's, which is a now-dead fork of pivot's [10:08:27] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4204062 (10Vgutierrez) [10:38:16] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4204122 (10Vgutierrez) [11:03:44] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4204216 (10Vgutierrez) [12:50:11] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4204462 (10Vgutierrez) [12:50:31] paravoid: ah I thought it was pivot's, there is a level of indirection then :D [12:51:59] bblack: do you know any good resource regarding .NET TLS capabilities? [12:53:08] https://docs.microsoft.com/en-us/dotnet/framework/network-programming/tls#the-schusestrongcrypto-flag --> this is a good one BTW [12:53:56] in general .NET are separated by several degrees of connection, it doesn't really inhabit the neighborhoods I wander :) [12:54:00] err [12:54:00] need to figure out if DotNetWikiBot/3.15 based bots are using because and old runtime on toolforge, or for any other reason (misconfiguration / bad practices on the code..) [12:54:05] ".NET and I are separated" [12:54:46] to some degree that's on the developers of the bot code. It might be more-fruitful to find who authored/authors DotNetWikiBot rather than the N users deploying variants/configs of it. [12:54:51] they might have a better idea about these things [12:55:34] so far I've identified two tools behind 5 different UAs running in toolforge [12:55:56] apparently runtime isn't unified between tools-exec instances [12:56:05] and that's why we get diffent UAs [12:56:12] /o\ [13:03:17] looking at the build deps of the mono source package in trusty it doesn't build depend on any system crypto library, so it seems to provide it's custom crypto [13:05:52] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4204514 (10Vgutierrez) [13:08:27] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4204518 (10MaxBioHazard) Do we need to stop using DotNetWikiBot framework, because you will disable encryption method, used in it? [13:12:29] yey.. already getting feedback from the users :) [13:15:55] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4204523 (10BBlack) It's more likely that DotNetWikiBot just needs to be built against a newer .NET version, or needs .NET configuration tweaks, to support better encryption (or p... [13:57:13] https://ripe76.ripe.net/presentations/10-2018-05-15-bbr.pdf [13:57:18] ema: ^ [14:06:14] bblack: nice one, yeah, I've skimmed through the presentation earlier on today [14:06:36] the interaction of cubic and bbr is particularly impressive [14:17:03] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4204704 (10MaxBioHazard) E-mail of DNWB author is codedriller@gmail.com . You can write a letter to he and explain, what he should do to fix this problem. [14:20:08] moritzm: I've tested mono 4.8.0 with DotNetWikiBot/3.15 on a docker container (FROM mono:4.8.0), and it's able to speak TLSv1.2 with a decent ciphersuite [14:20:33] decent --> ECDHE-ECDSA-CHACHA20-POLY1305 [14:21:01] yeah, but trusty has 3.2.8 :-) [14:23:12] not every tools-exec worker is running trusty [14:23:40] I''m seeing Mono 4.8.0 from tools-exec IPs [14:23:53] DotNetWikiBot/3.15 (Unix 3.13.0.141; Mono 4.8.0; .NET CLR 4.0.30319.42000). Activity detected: 10.68.16.126 (tools-exec-1441) this one for example [14:24:48] BTW, DotNetWikiBot doesn't provide any way of complying with our User Agent recommendations for bots :( [14:34:08] so if I use an older mono environment... like mono:3.10 I get the following UA: DotNetWikiBot/3.15 (Unix 4.9.87.0; Mono 3.10.0; .NET CLR 4.0.30319.17020) and bad TLS --> AES128-SHA [14:34:40] the Unix 4.9.87.0 part of the UA is basically the kernel I'm running on the docker VM [14:35:37] and DotNetWikiBot gets it from Environment.OSVersion.VersionString - https://msdn.microsoft.com/en-us/library/system.operatingsystem.versionstring(v=vs.110).aspx [14:35:51] migrating from trusty->(jessie? stretch?) I'm sure is on toollabs radar somewhere, but they may have some set of deprecation dates further off in the future than we'd like for trusty (and thus old mono) support. [14:36:24] but the fact that there are existing DNWB bots that seem to be running on Mono 4.8.0 at least inspires some hope that the others just need to shift to different nodes and work as expected. [14:36:45] bblack: hmmm if I'm reading the UA correctly, they have an updated mono compiled against old/legacy crypto libraries [14:39:07] at least for the bots using mono 4.8.0 [14:39:47] the ones running with mono 3.2.8 maybe need an update [14:40:03] I wonder how the tool maintainer picks the .NET runtime [14:41:28] I have no idea. It seems like from DNWB's sourceforge page, they intend for it to run on windows or mono, from a source development standpoint. [14:41:46] I'm not sure out the API/ABI stuff works out [14:43:32] oh.. they already provide built .DLLs.. [14:43:45] let me try those on an updated mono environment.. [14:43:53] I was compiling my own [14:45:21] * ema waves at vgutierrez while he can still see him down that rabbit hole [14:45:49] hmmm using the provided dll for mono (DotNetWikiBot.Build.for.Mono.dll) on a recent mono version (4.8.0) produces a decent TLS connection [14:45:52] ema: hahaah [14:46:05] ema: now that I'ven coded some C#? [14:46:11] I'm lost [14:46:39] so I'll try to reach our lovely cloud team [14:59:40] so we've two scenarios [14:59:58] 1 user using the default toolforge mono version and one that somehow uploaded it's own mono VM [15:02:59] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4204891 (10Vgutierrez) @MaxBioHazard I just tested DotNetWikiBot/3.15 on a docker container with mono 4.8.0 and it's able to use recent TLS ciphersuites, so you should be able to... [15:14:00] vgutierrez: we also unrelated have someone running their own custom TCL interpretor...if you ever find yourself having to care about that particular thing just run away. :) [15:15:29] chasemp: nah.. I'd don't care about their custom mono VM if the one that we provide is able to speak proper TLS :) [15:15:48] agreed [15:16:38] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4196798 (10Reedy) AutoWikiBrowser isn't showing in your requests... And that is .NET based, though mostly on Windows, rather than mono (though, there are some users that do..) D... [15:17:15] I guess it's only fair to provide them with a solution if we are hardening the TLS requirements :) [15:22:20] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4204922 (10Vgutierrez) >>! In T194380#4204918, @Reedy wrote: > AutoWikiBrowser isn't showing in your requests... And that is .NET based, though mostly on Windows, rather than mon... [15:24:58] 10Traffic, 10Operations: Identify bots using AES128-SHA maintainers running on toolforge - https://phabricator.wikimedia.org/T194380#4204926 (10Reedy) >>! In T194380#4204922, @Vgutierrez wrote: >>>! In T194380#4204918, @Reedy wrote: >> AutoWikiBrowser isn't showing in your requests... And that is .NET based, t... [15:40:37] chasemp: I don't know if it would be crazy to import the latest stable mono release from mono-project trusty repo to labs apt repo [15:41:38] chasemp: I've just tested this: https://phabricator.wikimedia.org/P7125 [15:42:21] the result is "DotNetWikiBot/3.15 (Unix 4.9.87.0; Mono 5.12.0.226; .NET CLR 4.0.30319.42000)" speaking proper TLS1.2 with ECDHE-ECDSA-CHACHA20-POLY1305 as a ciphersuite [15:42:31] with a stock DotNetWikiBot 3.15 [15:43:20] vgutierrez: should be fine, and as long as it works for our known existing case makes sense to me. Maybe easiest thing to do is add it as an upstream in reprepro [15:43:28] nothing else should be affected by that [15:44:22] chasemp: could cloud take care of this? should I open a task in phabricator? [15:46:03] please do, I'll bring it up in the meeting tomorrow, I'm not sure what the timeline would be [15:46:34] we are back2back2back big maintenance for switch moving and kernel updates, travel, and conferences [15:47:16] chasemp: any special tag? or it would be enough with subscribing you to the task? [15:47:36] Toolforge works and cloud-services-team [15:47:41] sure [15:48:05] thx :D [15:56:27] chasemp: https://phabricator.wikimedia.org/T194665 there you go :) [15:56:51] let me know if you need something else from our side [16:05:07] kk [16:30:55] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4205104 (10Joe) The following servers: ``` mc1012 mc1011 mc1010 mc1009 mc1008 mc1007 ``` should all be decommissioned by now, and definitely don't need any s... [16:31:17] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4205105 (10Marostegui) From a DB point of view, these servers need special care: db1061 - s6 primary master. We'd need the less downtime possible. Writes to f... [16:37:45] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4205127 (10jcrespo) Also most db hosts will need to be depooled (but that can be done for an extended time) due to Mediawiki bugs with timed out requests: T180... [16:56:00] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4205217 (10akosiaris) ganeti hosts are the housing for multiple VMs. Those will experience an outage during the recabling. Listing them here ``` aluminium.wi... [17:00:18] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#3991415 (10Volans) Yeah, `puppetdb1001` will probably just generate some spam on IRC for failing puppet runs, transient. Regarding `neodymium` the only thing... [18:06:32] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4205406 (10jcrespo) We should be able to failover logically dbproxy1007,8,9 to its hot spare, too. [18:13:18] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4205417 (10jcrespo) With the only above patches, the only special requirement for us is to handle db1061 (s6 master) on its own separate window- provide a real... [18:18:28] bblack: Do we blacklist anything on our auth DNS? cf. https://phabricator.wikimedia.org/T180277#4203089 I don't think so, but want to be 100% sure [18:19:53] I see that the IPs are blacklisted in http://dnsbl.spfbl.net/ but that looks unrelated [19:18:14] XioNoX: we don't blacklist anything for access to DNS lookups, no. It's possible one of our EU transits is blackholing some of their traffic though, if they're on some other blacklists? [19:18:47] XioNoX: or I guess, it's possible we have some outdated anti-dos rule somewhere on the one or more routers blocking that network (for the UDP DNS traffic but not ping?) [19:19:30] XioNoX: (it's not clear to me from those ticket replies though, whether it's only ns2 in esams they can't talk to, or all of them) [19:19:37] I don't see anything on the routers that could block it [19:19:50] Yeah, that's one of my follow up questions [19:20:25] It also would be bad if our transit were blocking traffic without us knowing [19:21:54] XioNoX: you should try pinging both of the different IPs/network examples they gave us, from baham :) [19:22:15] (or radon, or anywhere, I guess) [19:22:17] Done [19:22:27] TTL exceeded for all :) [19:22:45] yeah [19:22:51] I was gonna paste, then I realized IPs and this channel [19:22:58] must still be looping on their end? [19:23:21] Their network is funky, that doesn't help with tests from us to them [19:23:45] this seems sort of like the whole "omg my code doesn't work, it must be an Intel CPU bug!" sort of case [19:24:19] "my one random corporate network has problems accessing Wikipedia, clearly there must be a strange fault at Wikipedia, which is specific to our network, that they need to investigate" [19:24:26] Haha yeah [19:28:50] anyways, if I had to hazard a guess, I'd say even if their queries are making to to ns[012].wm.o, the replies are looping in the edge of their network somewhere and never making it all the way back [20:22:28] 10Traffic, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4205765 (10Ottomata) @bblack did this end up being a Q4 goal for traffic team? [20:25:36] 10Traffic, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4205770 (10BBlack) I think this ended up being an Analytics Q4 goal? It's not on our goals list, but we agree to alot some time to it in this Q... [20:40:54] 10Traffic, 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4205810 (10Ottomata) Ok, great! From our side, we're mostly looking on either more TODOs and/or approval to remove IPSec from jumbo + varnishka... [21:01:31] bblack: something I've been meaning to ask you and I'll make a task about it but food for thought. Fighting some mailman issues recently where we resorted to blocking IPs of known bad actors, and then we have a similar mechanism codified for Phabricator itself, and within Mediawiki. All at hte application layer. I have also resorted long ago to blocks at the edge router layer which is yet another place (YAP). [21:01:31] Thinking on where would be a good place to consolidate infrastructure based IP blocking. Wanted to get your early take on it. I'm assuming it would be a similar component but per site and so 'as code' would be needed to have a sane single point of truth. This got long but anyhow, not even now just sometime in the next few weeks I hope to get some time to talk. [21:09:40] chasemp: it's kind of a complicated topic, I don't think I can do it justice. I do agree we should have some more centralized/aligned thinking about the whole thing from these different perspectives (although not necessarily one shared solution for all purposes) [21:10:03] chasemp: maybe a good sub-topic for during SRE offsite? w/ mortizm + arzhel as well. [21:10:54] yup [21:11:25] (I have ideas and previous experience, at least for the network part) [21:13:00] see https://github.com/mozilla/BanHammer [21:15:02] Cool, summit topic seems great, and I'll try to bring a few seed ideas too that are thought about enough to kick around at least [21:17:56] (and agreed I don't know that we want to boil it all down to one thing but the overlap now is large, undefined, and only getting worse) [21:18:04] tx XioNoX that looks cool [21:19:42] it could be one place storing files with a list of ip addresses / cidr to ban [21:19:51] which are then read by various services [21:19:57] some pieces of the puzzle to keep in mind, thinking ahead for that discussion: [21:21:24] 1) There's some different domains of administrative control here, the edge-cases being Wiki-level admins (e.g. enwp privileged users) look at Wiki-level abuse at one end of the spectrum, and router-config-level blocking at the other (not even public repo or visible, and few can/should touch it). [21:22:20] 2) It's not necessarily true that some "abuse" network we want to block for service X should be blocked from service Y, but sometimes it is. [21:23:16] 3) Networks banned for abuse from various services at various levels need maintenance. Sticking them in a list to bitrot isn't great. eventually the abuse stops and the blockage is just left harming some legit use-cases, or the network changes hands, or especially for singular IPs was always somewhat dynamic, etc... [21:23:22] I was envisioning text files stored in puppet having block lines and an optional list of services [21:23:37] 3.5) (timeouts after which a blockage must be re-evaluated or reverted?) [21:23:57] comments explaining why it is done, too [21:24:45] 4) Anything that considers individual IPs or networks from singular abusive incidents should be distinct from our solutions for widely-distributed attacks (volumetric or otherwise), for which we'd probably use completely different and less-manual solutions XioNoX is looking into. [21:25:37] in some cases blocks should be complete, while in others read-only [21:26:10] bblack: thanks, good listing -- I am with you up to 4 where I'm wondering if they can be the same mechanism but separate control plane [21:26:24] but I have no conclusions, only questions atm [21:26:27] bblack: timeout is what we did with banhammer [21:26:55] same mechanism for network layer blocks that is, anyhoo, I'll take this and turn it into a task [21:27:03] also depends on how much efforts we want/can put into it :) [21:27:11] so true [21:28:11] we can imagine a webapp, that let you different things depending on your level, as well as expose different informations depending on how confidential it is [21:28:42] maybe I'll add a 5! [21:29:14] or a google spreadsheet pulled by a cron that updates all the necessary systems :) [21:29:39] I think the first thing I'll do is formulate some problem statements which will hopefully break down some of the layers [21:29:52] all the use cases [21:30:21] 5) In both the widely-distributed (volumetric or deeper/cheaper attacks) and not-so-widely-distributed cases, sometimes it's not even IPs we want to block on, but things like unique traffic attributes at the IP layer (funny header bits, ports, etc), or at the other extreme UA strings or other http request header patterns where that's applicable. Some of these other non-IP-things we might block o [21:30:27] n runs into a lot of the same issues/layering/maintenance/etc as IP-based blocking. [21:31:13] (e.g. currently we sometimes look at headers and/or UA strings in generic VCL to block for all traffic-terminated services from certain abusive scenarios) [21:31:19] * chasemp nods [21:32:02] and the IP layer bits aside from actual IP addresses/networks, is something we've talked about more in the context of volumetric DDoS, but could apply in other cases as well to help single out deeper/cheaper abuses. [21:33:10] * XioNoX puts it on the roadmap for the v2 [21:33:59] It's all intertwined for sure, and I was originally thinking in both DDoS and one-off terms but have lots of input needed on pushing codified and extensible banning to the edges for application layer things [21:35:02] maybe it will be easier to educate people to not attack us? [21:35:03] (by deeper/cheaper widely-distributed abuses in places above, what I mean is: we might see an attack from millions of widely-dispersed IPs, but it's not volumetric enough to saturate links or anything so it's not really the usual DDoS scenario, but is doing something else nasty like scanning passwords or trying to exploit bugs, or hitting expensive API calls) [21:35:20] I'm hesitant to try to solve it all at once bc of scope and resourcing but also it's so interrelated it'll be necessary to lay it all out to really think out what's the most useful [21:35:53] ack there [21:36:44] I tend to think a good tradeoff for something like is to try to cover the complete scope shallowly, and then narrow in on the parts we can realistically attack in the near-to-medium term for deeper discussion. [21:37:09] which is maybe just a rewording of what you just said heh :) [21:37:11] fair, that's also probably the most potential for iteration and shared lessons between means [21:40:54] And an open source tool that other companies might be interested in [21:50:40] 10netops, 10Operations, 10ops-eqiad, 10Patch-For-Review: Rack/cable/configure asw2-c-eqiad switch stack - https://phabricator.wikimedia.org/T187962#4205997 (10ayounsi) [22:55:22] do you agree that we should send an SPF record specifically saying "no email, -all" for all the parked domains? adding that to the "parking" template which affecsts them all at once: https://gerrit.wikimedia.org/r/#/c/429874/ [22:57:59] mutante: I'm just not 100% sure none of them are used for sending mail in some wikis' configurations (not mail sent by us, but mail sent directly by users that's probably already getting greylisted, but this would make it look worse, and then we do forward back to the user on sends to those addresses?) [22:58:30] I know we do that for e.g. @wikipedia.org and such, I just don't know whether it extends into some of what we consider the noncanonical domains over in the HTTP world. [23:00:48] bblack: i thought about that and it kind of made me "well, just do the parked domains for now, others might need more checks" [23:01:54] but that one template covers a lot at once that Keith/Valentin listed on https://wikitech.wikimedia.org/wiki/Domains [23:03:04] i could grep through some mail server logs for all the parked domain names i guess