[08:18:10] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#3961955 (10elukey) Nothing varnish-related happened on Feb 6th as far as I can see from the ops SAL: https://tools.wmflabs.org/sal/production?p=... [08:54:07] elukey: hi! [08:54:46] so is it the 10th of Feb or the 6th the day when opera mini things started changing? [08:59:29] ema: hello! Yeah the first change seems on the 6th, but Nuria made a very relevant comment in https://phabricator.wikimedia.org/T187014#4102802 [09:27:18] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4111348 (10ema) >>! In T187014#4110582, @Nuria wrote: > Varnish5 rollout might have something to do with this? https://gerrit.wikimedia.org/r/#/... [10:14:28] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4111464 (10Deskana) [11:24:28] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4111648 (10Nirmos) Not sure I understand your question. Are you asking how to fix the lint errors so that wikis can switch from Tidy to R... [11:26:40] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4111651 (10Elitre) >>! In T133410#4111612, @Zoranzoki21 wrote: > Hi, > happy holidays! I have question. How to we fix tags so migration c... [11:27:17] interesting, I was checking for discrepancies between the left-most X-Forwarded-For value and the Forwarded header from opera mini on a esams frontend [11:28:12] except for a bunch of requests without 'Forwarded', all those I looked into seemed fine [11:28:25] then I've repeated the experiment in eqiad, also all good [11:28:40] what's interesting there is the geolocalization of the opera mini IPs [11:28:52] most seem to be from India [11:29:56] then Ukraine (?) [11:32:29] out of 552 unique IPs I've collected in eqiad: [11:32:37] 289 GeoIP Country Edition: IN, India [11:32:42] 87 GeoIP Country Edition: UA, Ukraine [11:32:48] 31 GeoIP Country Edition: VE, Venezuela [11:32:54] 25 GeoIP Country Edition: LK, Sri Lanka [11:33:01] 21 GeoIP Country Edition: BD, Bangladesh [11:33:42] 20 GeoIP Country Edition: KG, Kyrgyzstan [11:33:47] 17 GeoIP Country Edition: AZ, Azerbaijan [11:34:06] 11 GeoIP Country Edition: IR, Iran, Islamic Republic of [11:34:14] 10 GeoIP Country Edition: CZ, Czech Republic [11:34:19] 9 GeoIP Country Edition: BR, Brazil [11:35:12] those in esams are instead all reasonable (Nigeria, Russia, Kenya, ...) [11:35:56] (lol @ my "geolocalization" above) [11:39:37] geol10n [11:41:35] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4111691 (10mbaluta) Please note that number of page views prior to 6th February seems incorrect from our perspective too - number of Opera Mini... [11:42:15] 10Traffic, 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4111693 (10Tgr) [11:44:07] i say geol6n is now a thing [11:55:36] also, by looking at XFF for those IPs weirdly ending up on a eqiad frontend, the Opera Mini server IPs seem to be georouted properly to ulsfo [12:10:05] what seems particularly weird is that I'm currently seeing on cp1068's frontend some requests w/ X-Client-IP 107.167.104.168 (one of opera mini's servers) [12:10:26] but those should end up in ulsfo according to `gdnsd_geoip_test generic-map 107.167.104.168` [12:19:59] (1) why is X-Client-IP: 107.167.104.168 ending up on a eqiad frontend [12:22:24] (2) who knows how good the geolocation of opera mini proxies is (all those real-IPs in India/Ukraine being served by eqiad don't seem encouraging) [12:24:23] I've gotta go afk for a bit [12:39:22] ema, vgutierrez: in case you're bored on a friday afternoon, i have another 1200 lines of test code for review ;) [12:39:40] hahaha [12:40:13] I checked that CR yesterday and I got scared [12:40:29] i revised it a bit today ;) [12:40:32] but i'm done now [12:40:37] ack [12:40:45] I'll give it some love this afternoon then [12:40:49] and i just worked on spitting off FSM into its own module [12:40:53] splitting [12:40:55] hmmm [12:41:05] what about getting the FSM from the 2.x branch? [12:41:14] that's something else entirely [12:41:44] the 2.x branch has some work on making pybal use an FSM instead of the state keeping booleans [12:42:02] we could use that on the Server class [12:42:04] the bgp FSM is an FSM exactly as defined in the BGP RFC [12:42:10] yeah, in 2.x [12:42:14] that's the idea [12:42:42] server.pooled and is_pooled is tricky and lead to mistakes [12:42:55] i know that, that's why we're working on an FSM implementation [12:43:08] .is_pooled isn't even used at all btw [12:43:25] my only reason for that change is to slightly clear up its current usage, until we get to 2.x [12:44:16] now we have unit testing it's a bit easier [12:44:34] i'm also trying to add complete statement coverage for unit testing, to migrate to python3 [12:44:43] at ~90% now, so we're getting there [12:49:12] so I guess your comment with .pooled = was thinking along the lines of an FSM [12:50:40] i'll push my other two changes I guess [12:51:37] OK I'm starting to think that something must be wrong with opera mini resolvers [12:52:04] ema is in The Zone :) [12:52:27] all the X-Client-IPs from ReqHeader:User-Agent ~ "Opera Mini" that I've seen so far in eqiad should actually be geo-routed to eqsin/ulsfo [12:52:46] mark: yeah :) [13:02:04] ema: well who knows what resolvers they are using and how those are geolocated... [13:06:58] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4111935 (10ema) >>! In T187014#4111691, @mbaluta wrote: > If you provided IP address of our server, we could at least tell whether it is coming... [13:09:20] mark: it looks like a specific issue of only some of the opera mini servers. Traffic hitting esams looks good at the moment (X-Client-IP properly sent to esams) [13:09:35] eqiad is all b0rked instead [13:10:31] do we know if they actually use edns-client-subnet? [13:15:19] ha, nice to see them on phab directly [13:18:29] I'm not sure re:them using edns-client-subnet, I think the answer is "at least partially" given that many of their requests do end up in non-eqiad DCs [13:18:50] and yes, nice seeing them on phab! [13:20:00] or they don't and they have clusters includig resolvers on IPs geolocated to non-eqiad? [13:20:31] that's a sad possibility too [13:20:41] and maybe they have some cluster somewhere without resolvers and then use a resolver from elsewhere - who knows :) [13:24:12] mmmh all this is probably orthogonal to the issue of skewed country-level numbers for opera mini [13:24:28] the stats we're looking at are based on the "Country" field in pageviews https://meta.wikimedia.org/wiki/Research:Page_view#Resulting_format [13:24:37] > A UDF that implements MaxMind geolocation - combined with something to detect valid XFFs and properly handle those, /and/ correctly identify IPs in the case that the request is coming from an SSL terminator or similar [13:25:21] elukey: do we know how the code doing this ^ looks like? [13:42:15] ema: I think we have a java specific maxmind client in the analytics refinery, but I am not sure expert about it [13:43:04] also, apparently cloudflare's 1.1.1.1 resolver strips edns-client-subnet in the putative name of privacy [13:43:16] it's kind of a brilliantly evil move on their part [13:43:43] what better way to simultaneously increase the accuracy of your CDN network's geodns targeting and also harm that of competitors? [13:44:35] announce a major public DNS service on a cool address (anyone who uses it, you can optimally geodns route them for your CDN because you see their client IP during the DNS phase directly), and don't support edns-client-subnet in order to de-optimize the situation when they hit non-cloudflare things. [13:45:14] that's evil(TM) [13:45:52] yet here 1.1.1.1 resolves enwiki to text-lb.esams [13:46:15] yeah but that might be lucky exit-point routing [13:48:00] ? [13:49:48] even if they've killed ECS, they're likely still trying to route DNS requests efficiently. Meaning if your DNS request arrived at a cloudflare server somewhere in the EU, the follow-on request from cloudflare->wmf is likely to also originate in the EU, and that exit IP might be geolocated in the EU, thus negating the need for ECS in the case of that particular request. [13:51:00] gotcha, 1.1.1.1 for me is EU-based hence I got geolocated properly [13:51:23] but then they might have different boundary-conditions than we do, so it wouldn't work well in edgier cases (e.g. on the borderlands between two of our DCs' regions). They might also, in the interest of efficiency and speed, have some kind of async sharing of DNS cache entries that crosses regions and might interfere without ECS support, etc [13:56:08] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4112076 (10ema) @mbaluta: note that the problem I've mentioned in my comment above is probably unrelated to the stats issue discussed here (woul... [13:57:23] https://developers.cloudflare.com/1.1.1.1/nitty-gritty-details/ [13:58:48] and they don't give any info on their exit points, except a link to their general list of cloudflare-owned networks. [13:59:56] I get the privacy angle they're pushing, it's semi-legitimate. But they could've just adopted reasonable masking policies. e.g. limiting to /20 or even /16 for v4 would give something to go on, and at least generally nail the approximately-correct region. [14:00:29] (also, from a public global dns cache perspective, ECS is probably very hard to implement efficiently, which is why so few do) [14:01:43] anyways, given the public details we're given, all we can hope for is that (a) they always forward DNS requests to us from exit IPs that are close-ish to the user and (b) that those IPs are correctly recorded in maxmind's database. [14:05:34] checking out some of cloudflare's ranges in geoip2-city, they seem reasonably-legit [14:05:59] (as in, they're not all saying some corporate address. I see different US states and a few foreign countries when checking various of them) [14:06:26] including one in singapore [14:06:40] (so hopefully, that's going to capture to eqsin most of the 1.1.1.1 users in asia) [14:54:31] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: pybal 1.15.2 dies with obscure errors without python-prometheus-client - https://phabricator.wikimedia.org/T190527#4112273 (10Vgutierrez) The issue has been solved on pybal 1.15.3 available for stretch [14:55:07] 10Traffic, 10Operations, 10Pybal, 10Patch-For-Review: pybal 1.15.2 dies with obscure errors without python-prometheus-client - https://phabricator.wikimedia.org/T190527#4112274 (10Vgutierrez) 05Open>03Resolved a:03Vgutierrez [15:00:49] ema, bblack: about the CP servers in eqiad C8 ( https://racktables.wikimedia.org/index.php?page=rack&rack_id=1963 ) do they need to start alltogether in the same row or they would benefit from being distributed across the row? (or in different rows as well) [15:12:12] restart? [15:13:01] (is this about timing of downtimes/restarts, or about redistrubuting them while moving the ethernets?) [15:13:29] in general redistribution probably isn't worth the effort, they're due for replacement Soon [15:22:43] bblack: it's about the asw->asw2 move, if some servers could be moved to different racks ahead of time that would be a win [15:22:51] bblack: what's the replacement timeline? [15:28:32] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4112360 (10Nuria) @ema: > I think we should try to debug the code that sets Country to "United States" for User-Agent: ~ "Opera Mini" and see wh... [16:09:30] XioNoX: looking, we dithered back and forth a bit on some of the cache refreshes in annual planning, but I think it's either this quarter, or the next FY... sec [16:10:06] XioNoX: they're slated for Q4 (this quarter), to replace all the cp1xxx [16:23:13] eqiad cache refresh is ASAP yes [16:23:16] please get that going :) [16:26:16] bblack: and those would be better distributed in several 10G racks or all in the same rack as current? (I'd guess the former) [16:28:17] XioNoX: better distributed [16:28:24] perfect [16:28:48] I'm unsure as to the final count we'll end up ordering, too, as plans at the software level keep slipping [16:29:34] (either 16 or 20, so either 4 or 5 per row) [16:30:24] well really, we can make it 16 and keep 4 older ones alive for a while too, out of the many older ones avail. [16:56:26] why? [17:37:55] 10Traffic, 10Operations, 10Patch-For-Review: Planning for phasing out non-Forward-Secret TLS ciphers - https://phabricator.wikimedia.org/T118181#4112738 (10MoritzMuehlenhoff) [17:37:59] 10Traffic, 10Operations: Remove 3DES patch from OpenSSL builds - https://phabricator.wikimedia.org/T180792#4112736 (10MoritzMuehlenhoff) 05Open>03Resolved This was resolved in the latest update of our OpenSSL 1.1 packages for jessie-wikimedia [17:38:36] bblack: i have a service on cache::misc (webserver_misc_static) that i recently made active-active by adding a codfw backend to the director. now i want to temp. server only from codfw to reinstall eqiad with stretch. was it right to comment out eqiad and just leave codfw like this: https://gerrit.wikimedia.org/r/#/c/423580/3/hieradata/role/common/cache/misc.yaml or was it that i should keep [17:38:42] "eqiad" but set that to the codfw backend.. i seem to remember asking this before and being wrong [17:39:38] i also remember i should never just flip it from one to only the other in a single step.. so that's how i activated codf by just adding it [17:58:00] or... i don't change the cache::misc config at all and just take eqiad down and expect it to route traffic to the codfw backend and that's it [18:35:29] mutante: yes, commenting out the eqiad side as you did on https://gerrit.wikimedia.org/r/#/c/423580/ is correct if you want to work on bromine [18:38:32] mutante: the thing you're mentioning about not flipping an active/passive service from one dc to the other in a single step is also true (although it does not apply in this case given that this service is active/active) and described here https://wikitech.wikimedia.org/wiki/Global_traffic_routing#Cache-to-application_routing [18:41:01] ema: thank you very much, i'll go ahead :) [18:41:16] that links is also helpful [18:42:08] it is! :) [18:45:49] P.S. I renamed the director to "webserver_misc_static" when i made this active-active. Lots of the are named after the hostname of backends. like it was "bromine" with a single backend of bromine but doesnt really make sense to give them that name [18:46:17] especially not once there is a 2nd backend [18:46:23] right [18:46:44] the things that are served by this are 15.wikipedia.org , annual report , transparency report etc. [19:21:28] 10Traffic, 10Operations, 10ops-codfw: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4113004 (10Papaul) Hello Papaul, Thank you for sharing the log. I am currently in Training, however I got a chance to look at the TSR and analyzed it. We do see that the firmware on the ser... [19:33:13] 10Traffic, 10Analytics, 10Analytics-Data-Quality, 10Analytics-Kanban, and 4 others: Opera mini IP addresses reassigned - https://phabricator.wikimedia.org/T187014#4113030 (10Nuria) >number of Opera Mini users in US is far far below India, Indonesia and Nigeria. Note these are "pageviews", not users. @ema... [20:01:42] 10Traffic, 10Operations, 10ops-codfw: cp2008 memory replacement - https://phabricator.wikimedia.org/T191224#4113142 (10RobH) @Papaul: Please advise to Dell that we saw the error in the logs we provided, and we aren't willing to use the faulty hardware in production without replacement of the memory modules a... [21:01:41] 10netops, 10Operations: Juniper HA audit - https://phabricator.wikimedia.org/T191667#4113291 (10ayounsi) p:05Triage>03Normal