[10:40:49] <_joe_> bblack: can you please take a look at my current etcd "multi-dc" plans? I know you would prefer a true multi-dc cluster but I have been repeatedly advised to wait an upgrade to etcdv3 before really trying that, and even then, I am not sure we want to bet on multi-dc raft in the short term [10:41:15] <_joe_> the current idea to have at least replication and active-active reads is here https://phabricator.wikimedia.org/T159687 [11:15:07] 10netops, 06Operations: netops: switch all subnets to use install1002/2002 as DHCP - https://phabricator.wikimedia.org/T156109#3075079 (10akosiaris) >>! In T156109#2995356, @Dzahn wrote: > I realize this might be on your last day before you are away for a while, please feel free to put up for grabs and i'll as... [13:07:29] hi, there's an icinga warning for disk space oncp1052 [13:23:30] moritzm: looking [13:24:28] _joe_: honestly, I don't know enough about etcd/raft's existing capabilities and mirrormaker, etc to have much of an informed opinion. What's in the ticket sounds complicated, but, I donno maybe that's our best bet? [13:25:21] _joe_: I think most of these tools (e.g. etcd), they're just not thinking about geographic issues in the first place and it shows in the end. Doing all kinds of clustering and scaling and data-sharing and various levels of gaurantees is much easier when you don't have to consider the geographic problem, of course :) [13:27:36] _joe_: What I wish for in my ideal world, is a simple state-management DB (like etcd) which does more-complex consensus stuff: either geo-aware leadership elections and reader/writer-routing (with a plan for what happens on split brain recovery...), or alternatively one that has something more like distributed clock eventually-concsistent semantic ala cassandra (DC layout aware, need N/M nodes co [13:27:42] mmitting locally and X/Y remotely for transaction success, etc). [13:31:21] I don't know that there's any simple answer that works well today, given etcd as a starting point. [14:46:13] <_joe_> I don't think there is one, no [14:46:52] <_joe_> we could switch to cassandra, of course [14:47:30] * elukey records the last line and sends a picture to Giuseppe for posterity [14:47:45] <_joe_> but then you're left with the problem: is eventual consinstency always good enough? [14:48:19] <_joe_> (and then ofc in cassandra you cannot watch keys, etc) [14:52:32] yeah I was just using its consistency model as an example [14:52:49] cassandra's a database, not a dist kv store for config/meta/balance :) [14:53:28] the etcd model as it is works great within a DC I think [14:54:25] we could also considering having two separate stores [14:54:57] each DC gets a 3xEtcd cluster for dc-local data (node pooling and failover and other related metadata within the DC), which is completely separate from other DCs. [14:55:39] and then there's a global etcd cluster with, say, 1 node per DC in all 5 DCs, which only handles global-level metadata (lower-traffic!) and can use raft to get around isolated DCs and such? [14:56:02] or 2 per DC, whatever [14:56:18] (or 1 per in edges, 2 in each core) [14:56:40] just tossing out a random idea though. I have no idea how that would play out in actual scenarios. [14:57:02] if an edge DC becomes network-isolated, would it still validly read the last-known-good data from its local node? [14:57:18] (would there be any way to update that data during isolation if we connect oob?) [15:41:18] 10netops, 06Operations, 10fundraising-tech-ops: set up firewall policies for barium, lutetium, db1025, and indium replacement servers - https://phabricator.wikimedia.org/T159336#3076013 (10Jgreen) [15:43:16] 10netops, 06Operations, 10fundraising-tech-ops: set up firewall policies for barium, lutetium, db1025, and indium replacement servers - https://phabricator.wikimedia.org/T159336#3064590 (10Jgreen) Also (fundraising private repo): commit 8e403abe1e552b078d217479c9f48ed23d892380 Author: Jeff Green bblack: just reading, yes, but we also designed the apps to react in that way if etcd is unreachable [16:01:22] <_joe_> also, what data is local to what? [16:01:26] <_joe_> that's hard to map [16:01:43] <_joe_> because data about a varnish server in eqiad is needed by a varnish server in esams [16:02:02] <_joe_> so you would need to have a crazy replica set or incredibly complex clients [16:02:18] <_joe_> my plan is to do this crossed replica for now [16:02:35] ok [16:02:49] <_joe_> and then move to a global cluster IF etcd3 is actually better [16:03:00] <_joe_> with high-latency raft [16:03:15] well I trust your judgement. like I said earlier, I've done about zero real digging into all of this :) [16:03:36] <_joe_> for now my failure model is "it will take a few minutes to have etcd available for writing if one DC fails" [16:04:09] <_joe_> which is already way better than what we have now [16:05:15] ok :) [16:07:30] <_joe_> oh and btw, once you feel confident with dns discovery, https://gerrit.wikimedia.org/r/#/q/topic:app_routes_deprecation+(status:open) [16:08:07] <_joe_> this series of patches should remove any reference to application routes from puppet [16:18:37] <_joe_> bblack: am I correct that discovery will go live once you find the way to fix the dns lint job in CI? [16:21:35] _joe_: yes, basically [16:22:11] lint fix, and then a followup zonefie patch that deploys the new discovery hostname records [16:22:16] *zonefile [16:22:46] I stabbed at lint fix a little last week and it got complicated, need to retry sometime early this week [16:22:47] <_joe_> ok [16:23:10] <_joe_> yes please, we kinda need to keep this moving [16:23:18] <_joe_> :) [16:24:11] basically modules/authdns/init.pp needs a split into "basic underlying support that a server and a linter need", with the server bits moving out to authdns::server and such [16:24:54] it gets complicated because the linter uses a chroot to a directory that seems to be defined at runtime when lint invokes, and some of the underlying shared basics of lint+server currently use fixed paths in /etc/ [16:25:08] so, yeah [16:25:15] resolveable, just complicated [17:38:59] 10Domains, 10Traffic, 06Operations: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3076309 (10Beetlebeard) >>! In T158638#3043708, @Reedy wrote: > https://github.com/wikimedia/operations-dns/blob/master/templates/wikimedia.ee > > If you follow "Add a record to... [18:22:17] 10Domains, 10Traffic, 06Operations, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3076473 (10Dzahn) @Beetlebeard how does that Gerrit link look to you? [18:32:14] 10Wikimedia-Apache-configuration, 06Operations: Create 2030.wikimedia.org redirect to Meta portal - https://phabricator.wikimedia.org/T158981#3076590 (10Dzahn) a:03Dzahn [18:32:48] 10Wikimedia-Apache-configuration, 06Operations: Create 2030.wikimedia.org redirect to Meta portal - https://phabricator.wikimedia.org/T158981#3053534 (10Dzahn) @gpaumier The name can be used. I will take this to get it up before Wednesday. [18:43:20] 10netops, 06Operations: Slight packet loss observed on the network starting Nov 2016 - https://phabricator.wikimedia.org/T154507#3076644 (10Ottomata) p:05Triage>03Normal [19:02:25] 10Traffic, 10MediaWiki-API, 06Operations: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#3076735 (10Ottomata) p:05Triage>03Normal Ping @ema. I'm not sure how to triage this. Does something need to change on the varnish end? Or just the change @An... [19:06:39] 07HTTPS, 10Traffic, 06Discovery, 06Operations, and 2 others: compile number of http uses for http://www.wikidata.org/entity - https://phabricator.wikimedia.org/T154017#3076741 (10Ottomata) p:05Triage>03Low [19:07:37] 07HTTPS, 10Traffic, 06Discovery, 06Operations, and 3 others: announce breaking change: http > https for entities in rdf - https://phabricator.wikimedia.org/T154015#3076747 (10Ottomata) This has been placed on the Traffic board, removing operations tag. [19:17:29] 10Domains, 10Traffic, 06Operations, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076776 (10Ottomata) p:05Triage>03Normal I'm not sure who will make this decision, but @robh often handles cert issues, so let's ask him. [19:18:46] 10Domains, 10Traffic, 06Operations, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#2855117 (10BBlack) Probably this should be merged into T133548, unless it's altered to be about implementing some other solution outside of... [19:18:52] 10Wikimedia-Apache-configuration: URL to pagenames with special characters fail - https://phabricator.wikimedia.org/T153275#3076797 (10Ottomata) Pretty sure there is no operations action item here, so removing tag. [19:22:55] 10Domains, 10Traffic, 06Operations, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076819 (10RobH) Implementation of SSL on our cluster is handled by the traffic team. So if there is a problem, they would be ideal to ask.... [19:28:30] 10Domains, 10Traffic, 06Operations, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076823 (10Dzahn) @Robh yea, but wikipedia.cz is our IP and DNS servers [19:28:42] 10Domains, 10Traffic, 06Operations, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076837 (10Ottomata) 05Open>03declined @Urbanecm, I'm going to decline this ticket then. Wikimedia CZ owns this domain, so they'd have... [19:28:45] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#3076839 (10Ottomata) [19:29:17] 10Traffic, 06Operations: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3076841 (10Ottomata) p:05Triage>03Normal I don't fully understand this issue, but I'm not sure how it could be fixed on our side. If there is a way, its likely to be very... [19:33:21] 07HTTPS, 10Traffic, 06Operations, 13Patch-For-Review: Create a secure redirect service for large count of non-canonical / junk domains - https://phabricator.wikimedia.org/T133548#3076848 (10Ottomata) [19:33:23] 10Domains, 10Traffic, 06Operations, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076846 (10Ottomata) 05declined>03Open Re-opening, I think both @robh and I read this as 'wikimedia.cz', not 'wikipedia.cz'. wikipedia.... [19:39:37] 07HTTPS, 10Traffic, 06Operations, 05Security: $wgServer with initial https:// does not force HTTPS - https://phabricator.wikimedia.org/T156320#3076936 (10Ottomata) [19:41:11] 10Domains, 10Traffic, 06Operations, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076946 (10Urbanecm) >>! In T152622#3076846, @Ottomata wrote: > Re-opening, I think both @robh and I read this as 'wikimedia.cz', not 'wikip... [19:42:47] 10Domains, 10Traffic, 06Operations, 15User-Urbanecm: Wikipedia.cz and other domains owned by WMCZ have invalid certificate - https://phabricator.wikimedia.org/T152622#3076984 (10Urbanecm) For clarifying: * wikimedia.cz was linked just for case when somebody want's to look what I mean by WMCZ as it links t... [19:43:11] 10Traffic, 06Operations: Hardware purchasing for Asia Cache DC - https://phabricator.wikimedia.org/T156033#3076986 (10Ottomata) p:05Triage>03Normal [19:43:21] 10Traffic, 06Operations: Hardware installation for Asia Cache DC - https://phabricator.wikimedia.org/T156032#3076989 (10Ottomata) p:05Triage>03Normal [19:43:39] 10Traffic, 06Operations: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#3076990 (10Ottomata) p:05Triage>03Normal [19:43:48] 10Traffic, 06Operations: Select site vendor for Asia Cache Datacenter - https://phabricator.wikimedia.org/T156030#3076992 (10Ottomata) p:05Triage>03Normal [19:43:59] 10Traffic, 06Operations: Name Asia Cache DC site - https://phabricator.wikimedia.org/T156028#3076995 (10Ottomata) p:05Triage>03Normal [19:44:08] 10Traffic, 06Operations: Configuration for Asia Cache DC hosts - https://phabricator.wikimedia.org/T156027#3076997 (10Ottomata) p:05Triage>03Normal [19:45:30] 07HTTPS, 10Traffic, 06Operations, 05Security: $wgServer with initial https:// does not force HTTPS - https://phabricator.wikimedia.org/T156320#3077006 (10Ottomata) p:05Triage>03Normal [19:50:59] 07HTTPS, 10Traffic, 06Discovery, 06Operations, and 3 others: announce breaking change: http > https for entities in rdf - https://phabricator.wikimedia.org/T154015#3077052 (10Ottomata) p:05Triage>03Normal [19:55:03] 10Domains, 10Traffic, 06Operations, 13Patch-For-Review: Using wikimedia.ee mail address as Google account - https://phabricator.wikimedia.org/T158638#3077090 (10Dzahn) a:03Dzahn [20:47:51] 07HTTPS, 10Traffic, 06Discovery, 06Operations, and 3 others: announce breaking change: http > https for entities in rdf - https://phabricator.wikimedia.org/T154015#3077327 (10Smalyshev) p:05Normal>03Low Note that this is not a traffic/ops question, it's wikidata modeling question, and so far I am not c... [21:04:42] 07HTTPS, 10Traffic, 06Discovery, 06Operations, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#3077452 (10Lydia_Pintscher) [21:04:46] 07HTTPS, 10Traffic, 06Discovery, 06Operations, and 3 others: announce breaking change: http > https for entities in rdf - https://phabricator.wikimedia.org/T154015#3077451 (10Lydia_Pintscher) 05Open>03stalled [21:28:46] 10netops, 06Operations: netmon1002 networking setup - https://phabricator.wikimedia.org/T159757#3077581 (10RobH) [21:54:05] 10Traffic, 06Operations: Fix broken referer categorization for visits from Safari browsers - https://phabricator.wikimedia.org/T154702#3077681 (10JKatzWMF) @Ottomata Thanks for voicing your uncertainty. I am also uncertain/confused about the cause (T87276), or the solution as it is a bit out of my technical d... [22:22:22] 10Traffic, 10MediaWiki-API, 06Operations: Varnish does not cache Action API responses when logged in - https://phabricator.wikimedia.org/T155314#3077838 (10Tgr) >>! In T155314#2945672, @Anomie wrote: > A rough idea might be to add code (probably in SessionBackend and `User::loadFromSession()`?) to set some f... [22:55:27] 10netops, 06Operations, 10ops-codfw: codfw: ms-be2028-ms-be2039/switch port configuration - https://phabricator.wikimedia.org/T159765#3077959 (10Papaul) [23:19:47] 10netops, 06Operations, 10ops-codfw: codfw: ms-be2028-ms-be2039/switch port configuration - https://phabricator.wikimedia.org/T159765#3078094 (10RobH) [23:23:14] 10netops, 06Operations, 10ops-codfw: codfw: ms-be2028-ms-be2039/switch port configuration - https://phabricator.wikimedia.org/T159765#3078101 (10RobH) Papaul: This seems to be a duplicate of T158714, but they have different info for some of the ports. Also this links to a task about wtp2019 currently, whic... [23:23:46] 10netops, 06Operations, 10ops-codfw: codfw: ms-be2028-ms-be2039/switch port configuration - https://phabricator.wikimedia.org/T159765#3078104 (10RobH) a:05RobH>03Papaul Basically we'll need to know if the details in this task are correct, or if previously done T158714 is correct. [23:30:49] 10netops, 06Operations, 10ops-codfw: codfw:ms-be2028-ms-be2039 switch port configuration - https://phabricator.wikimedia.org/T158714#3078145 (10RobH) [23:47:15] 10netops, 06Operations, 10ops-codfw: codfw: ms-be2028-ms-be2039/switch port configuration - https://phabricator.wikimedia.org/T159765#3078190 (10Papaul) 05Open>03declined