[00:36:52] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [00:40:11] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [00:48:06] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [00:50:19] 10Traffic, 10netops, 10Operations: IPv6 ~20ms higher ping than IPv4 to gerrit - https://phabricator.wikimedia.org/T211079 (10ayounsi) [00:50:21] 10netops, 10Operations, 10Performance-Team (Radar): Stop prioritizing peering over transit - https://phabricator.wikimedia.org/T204281 (10ayounsi) 05Open→03Resolved Everything here is done. Will reopen if any signs of issues down the road. [06:45:09] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10akosiaris) [06:45:30] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10akosiaris) [06:53:58] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10akosiaris) I was looking at Special needs or unsorted. @ayounsi I 've updated a few, feel free to move them to other sections. Pinging: * ge-2/0/... [09:00:19] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10fgiunchedi) [10:45:37] 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Planet: enable HSTS on *.planet.wikimedia.org - https://phabricator.wikimedia.org/T132543 (10Dzahn) [10:51:57] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10hashar) contint1001 hosts the CI system subscribing @thcipriani as well It is not clear to me what this operation is about. Is that just about re... [10:52:34] 10HTTPS, 10Traffic, 10Operations, 10Wikimedia-Planet: https://planet.wikimedia.org redirects to http://meta.wikimedia.org/wiki/Planet_Wikimedia - https://phabricator.wikimedia.org/T70554 (10Dzahn) [11:40:43] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10elukey) About Analytics nodes: * ge-1/0/7 - kafka-jumbo1001 -> Kafka needs to be stopped ~10/15 minutes beforehand to have a graceful shutdown (... [11:47:49] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10elukey) ` ge-6/0/25 - mc1019 ge-6/0/26 - mc1020 ge-6/0/27 - mc1021 ge-6/0/28 - mc1022 ge-6/0/29 - mc1023 ` The above ones are holding the eqiad m... [13:09:52] 10Traffic, 10Operations: Indexing of https://www.wikidata.org in the Yandex Search Engine - https://phabricator.wikimedia.org/T217407 (10jbond) p:05Triage→03Normal [16:23:37] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [16:41:55] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [17:00:34] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) [17:07:31] 10netops, 10Cognate, 10Growth-Team, 10Language-Team, and 6 others: Rack/cable/configure asw2-a-eqiad switch stack - https://phabricator.wikimedia.org/T187960 (10ayounsi) Thank you all for the quick replies! >>! In T187960#5000934, @hashar wrote: > Is that just about re cabling the server from a switch to... [17:51:19] er, https://cablemap.info/ has moved to the dark side, it's now a commercial product... [17:52:02] doh [17:53:46] Also asking an engineer at the largest ISP in Colombia about latency to codfw/eqiad: "I think there will not be a significant difference between both locations, Dallas and Ashburn." [17:55:37] and yeah, the cable map in the Caribbean is quite a mess but everything lands in Florida [17:57:46] And great circle distance between Miami and Dallas is ~1800km, and to Ashburn ~1500km [18:00:21] "The map is available at: https://dev.networkatlas.org/" [18:00:31] click the link and it goes to https://networkatlas.com/ [18:01:03] lol, both of the urls drop the www and s/org/com/ [18:01:25] https://live.networkatlas.com/ is kinda easier to view though [18:02:25] ah, didn't see that one [18:02:37] there is another guy also documenting land cables [18:02:51] I need to find it again, it's somewhere in my bookmarks [18:03:06] ah the new one has land too! [18:03:25] it took a few clicks to find it :) [18:03:35] the old map has different link size for capacity [18:03:42] here they're all the same [18:04:19] wow, land fibers in SF is soooooo well documented [18:05:11] I thought it was amusing/scary you could view fibre along various streets in NYC (on other maps) [20:05:18] XioNoX: not a priority, but I added you to T216137 so you know these 2 are the only servers on this datacenter for which we will need 10G [20:06:01] not sure if how hard that may be in terms of dc positioning and networking configuration [20:06:28] we == database-related servers [20:18:45] jynus: thanks, what would be the expected traffic flows? [20:19:09] (I can ask on the task if it's too long for right now :) ) [20:20:05] "what would be the expected traffic flows" what do you mean? [20:20:20] peak bandwidth? [20:24:10] jynus: some high level of "we're expecting 10 other servers on the same DC to talk to this server" or "it's going to be daily spikes of 10Gbps transfert between eqiad and codfw" etc. [20:24:14] the intention is to get 400 GB/s each 4 hosts incoming from memory to its SSD from 4 simultaneous databases, and as much as the 10G outgoing connection to bacula [20:24:36] what the ssds will be able to do is other story :-) [20:24:53] but at the moment we normally saturate the 1G link with a single connection [20:25:39] so it would be 10G backup traffic toward bacula? Would that be continous or spiky? To the same DC or cross DC? [20:25:41] there will be around 13 hosts (probably less) sending the backup to each host [20:26:44] bacula won't be able to use the ful link because it will be all hds, it is the momory to SSD (incoming) that will likely saturate it [20:27:30] that is peak, normally we only expect 2 1Gb simultaneous connections for 12 hours a day [20:28:04] we will tune the paralism, for now it is 2 for logical backups and 2 for pysical [20:28:28] I can actually give you a simulation [20:29:23] XioNoX: this: https://grafana.wikimedia.org/d/000000274/prometheus-machine-stats?panelId=8&fullscreen&orgId=1&var-server=dbstore1001&var-datasource=eqiad%20prometheus%2Fops&from=1551212950705&to=1551817690706 [20:29:44] but happening on a faster host twice as fast and every day instead of every week [20:29:45] ok, note that this doesn't block anything, but I want to understand in which quantity and which directions the packets are flowing to make sure the infra is scaled properly [20:29:56] ^that is the best idea [20:30:21] except on an ssd and more powerful host and happening almost continuosly :-D [20:30:56] so that says "received" so it's what other hosts are sending to this machine [20:30:57] please don't hate us 0:-) [20:31:02] yep [20:31:17] jynus: is that bacula or something else? [20:31:19] we will also send to bacula, but taht is low [20:31:29] jynus: I'm always happy when people actually use the network :) [20:31:29] so if you want the details, I have a graph [20:31:34] jynus: just claim that your work involves that [20:31:36] this is the postprocessing backups [20:31:45] instead of downloading torrents in bacula ;) [20:32:06] actually torrent is (not kidding) in the horizon [20:32:14] but not for this [20:32:45] XioNoX: so "faster backups" is one side of the reason [20:32:51] xml dumps maybe [20:33:15] the other part is should the worst thing happen, and we have to recover 10 hosts at the same time, this should be able to handle it [20:33:38] so in an "everthing is broken scenario" [20:34:02] XioNoX: I propose to, not this quarter, have a meeting and send you documentation [20:34:13] so you participate om the discussion [20:34:30] the design commitee probably predates your hiring :-) [20:34:48] but it was done with m*rk and everybody's input [20:35:08] so I belive it made sense [20:35:09] jynus: yeah sure [20:35:14] jynus: that looks good to me [20:35:20] yeah it makes sens [20:35:34] jynus: is there any cross DC flows or everything will stay local? [20:35:37] no [20:35:51] we just duplicate everything among dcs [20:36:16] as I said multiple times, cross-dc only happens in a "we lose a full dc scenario" [20:36:32] wikipedia is down or something [20:37:09] if that is your biggest fear, not happening, unless everthing is down in which case probably you wouldn't care :-P [20:38:29] jynus: I'm asking as we have 10G links between DCs, so if we have regular 10G transfer between sites we need to upgrade those links [20:38:43] no, not happenging regualy [20:38:57] so yeah, that's cool [20:39:26] the sumary of this is: many databases -> provisioning server, provisioning server -> 1 database at a time, and provisioning server -> long term backups (bacula) [20:39:35] everything withing dc [20:40:11] makes snes [20:40:33] cross DC might be bacula->bacula I guess, but that's another story [20:41:25] I don't know about that [20:41:44] but if it was, long term is much smaller for our service [20:42:29] ok [20:42:30] cool! [20:42:34] thanks! [20:49:13] yeah, if we ever did get to where we had some bulky saturating x-dc traffic that wasn't realtime-critical, we could always either play some kind of qos/shaping games to keep it from impacting more-important flows, or alternately set up a separate virtual circuit or tunnel for it. [20:49:21] (e.g. backups) [20:51:20] in our strategy, we would be cool with having only local backups, but I don't know what is the overal bacula strategy [20:52:24] sending them only localy rather than let it replicate would be way more efficient and would create extra redundancy against application-level corruption [20:59:12] bblack: for the short term, rate-limiter works, but it's a pain to manage long term. It's usually cheaper to get larger pipes (in term of engineering time, configuration complexity, eliminating bottlenecks, etc..) [23:02:24] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1058.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:02:36] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1059.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:02:50] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1060.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:03:04] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1061.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:03:15] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1062.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:03:32] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1063.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:03:45] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1064.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:03:57] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1065.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:04:10] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1066.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:04:23] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1067.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:04:35] 10Traffic, 10Operations, 10decommission, 10ops-eqiad: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for cp1068.eqiad.wmnet and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB... [23:19:48] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH) [23:20:30] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission old eqiad caches - https://phabricator.wikimedia.org/T208584 (10RobH) a:03Cmjohnson