[02:15:27] 10Domains, 10Traffic, 10Operations, 10WMF-Design, and 2 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282 (10Prtksxna) [02:15:32] 10Domains, 10Traffic, 10Operations, 10WikimediaUI Style Guide, 10Patch-For-Review: Redirect design.wikimedia.org/style-guide/wiki/* to design.wikimedia.org/style-guide/ - https://phabricator.wikimedia.org/T200304 (10Prtksxna) 05Open>03Resolved Thanks a ton @Dzahn! Works now {icon smile-o} [03:18:30] 10netops, 10Operations: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) Some doc provided by JTAC: https://www.juniper.net/documentation/en_US/release-independent/vcf/information-products/pathway-pages/vcf-best-practices-guide.pdf https://www.juniper... [07:28:53] ema: buongiorno [07:28:56] one q [07:29:17] is it possible to unset a header only for external reqs? [07:30:58] mobrovac: dobro jutro [07:33:05] mobrovac: https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L172 [07:34:51] yes! that's what I need! [07:35:27] hvala ema [07:35:54] :) [07:36:10] ah, now i see below x-client-ip only for localhost [07:36:16] but i believe that to be wrong [07:37:06] let's say an external req reaches an entity inside our env, and that service sends another request to varnish and fwds the header [07:37:15] with this in action we will lose the info [07:41:42] that's a thorny subject [07:42:37] I don't think we have specific guidelines/best practices for how internal services should access the CDN layer [07:42:58] agreed [07:44:14] the reasoning behind accepting X-Client-IP from the local nginx only is of course that we don't like people spoofing it :) [07:44:54] if you look a bit below that though there's the X-Trusted-Proxy exception [07:45:03] https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb#L228 [07:45:44] there we parse X-Forwarded-For and override X-Client-IP, for selected proxies only [07:50:56] ah ok [07:50:57] good point [07:52:26] ha! gerrit doesn't let me add you as a reviewer because your nick is to short to appear in the list [07:52:30] nice trick ema [07:52:31] haha [07:52:36] here's the patch - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/451240/ [07:52:56] wat [07:53:30] mobrovac: you're right! ema@ works though [07:53:39] ah ok [07:53:45] now that i know the trick ... [07:53:46] :P [07:53:58] please don't share it widely [07:54:39] haha [07:55:15] mobrovac: could you maybe add a comment in the code explaining the reasoning too? [07:55:49] sure [07:58:19] ema: done [08:48:14] Oh thats how [08:48:50] * mark gives valentin a relief ;p [08:54:50] uh? :P [10:36:57] now marko figured out how one can add ema as reviewer ;p [10:37:09] hahahah [11:02:52] but i guess you're both lucky that i'm too busy with management stuff these weeks ;( [11:10:32] not lucky, just more focused in our goals :) [11:50:43] mobrovac: I think eventually we'll want to set a req id from varnish as well (for the initial external requests). I haven't spent any time nailing down the format so it can match the opentrace span stuff, etc [11:51:00] mobrovac: if you have details handy though, it's probably not hard to implement [11:51:02] yup bblack, that's the idea [11:51:28] bblack: T201409 [11:51:29] T201409: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 [11:51:53] "Use a UUID v1/v4 x-request-id header/entity. Varnish f-e (soon ATS) is the main point of entry of external requests. Therefore, it can generate the request IDs and attach them to requests in the form of the x-request-id header, which can then be used and propagated by all entities behind it." [11:53:09] what concerns my naive pov is ian's: The overlap that I see is that the request ID generated by Varnish could be compatible with the span IDs that Opentracing implementations use, and in that case, we'd be able to use the request ID as the "parent span". [11:53:30] does a basic uuid cove rthat, or is compat with opentrace spans something else beyond just a uuid? [11:53:33] 10Traffic, 10Core-Platform-Team, 10Operations, 10Performance-Team, and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10mobrovac) [11:53:59] (in terms of format or generation) [11:54:26] bblack: that's the part that i yet have to look into [11:54:49] but i agree with ian that we should have something that is compatible/interoperable with open tracing efforts [11:55:23] the open tracing specs are on my today afternoon's reading list [11:55:43] other people read love stories on hot summer days ... [11:55:48] :) [11:56:43] from spending like 3 minutes on it, I think opentrace ids are still just opaques, and the span relationships are created with other metadata, in which case we don't have to worry at this level [11:57:17] but back at the zero minute mark, I kind of assumed there must be some meta-format to how the request id content itself was constructed, in order to encode some span-closure idea in them [11:57:27] I'm still not 100% sure :) [11:58:52] time uuids encode the ts at the time of a request's entry, so a second part to it would need to be devised for the end of response part [11:58:56] yay for identification of requests throughout our stack [11:59:40] but since we know when the request ends, perhaps just attaching a second uuid would be enough [12:00:16] right [12:00:21] as in, if every entity in our stack has two uuids, then we can have flame graphs and all of the niceties, but that complicates things considerably [12:00:43] i wrote the task with "i want all log entries for this req id" in mind [12:01:22] we can set x-request-id as a request-side header immediately on ingress (for other parts of the stack to see in the request), and then our webrequest logs are actually generated at the final point of delivery (the last point we touch the response in varnish before the user sees it), so we could put both the initial request-id and a finisher uuid into the webrequest data [12:01:32] technically, if we had that, we could devise flame graphs nonetheless by using the ts of log entries of each entity in the request chain [12:01:50] ah yes, indeed [12:06:44] so if you want the time encoded, v1 is the way to go [12:07:13] we might do a custom implementation and hash the macaddr part or something [12:07:27] (or base it on some other unique property rather than macaddr, perhaps) [12:07:52] the macaddr is already part of uuid v1 iirc [12:09:14] right, I just mean we might not use the raw macaddr directly in the UUID, it's kind of a strange leak [12:09:49] you can set th emulticast bit an duse something else in its place (hashed macaddr, or hashed something-else-node-unique) [12:11:03] bblack: I've started the numa_networking reboots, stopped shortly before lunch because cp2022 funnily rebooted into d-i [12:11:12] I'm gonna reimage it as stretch now [12:11:26] yeah sometimes we find nodes that have been up forever are accidentally set to pxe-first [12:11:27] ah right right [12:11:36] we should have some way to audit that with dellomsa or ipmi or something [12:12:32] Krenair: I've seen that you've implemented the metadata API in certcentral returning text/yaml [12:13:27] but according to https://puppet.com/docs/puppet/4.8/http_api/http_file_metadata.html#supported-response-formats [12:13:35] we should be using text/pson [12:14:27] https://puppet.com/docs/puppet/4.8/http_api/pson.html [12:15:19] (thanks puppet for PSON btw ¬¬) [12:15:50] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on neodymium.eqiad.wmnet for hosts: ``` cp2022.codfw.wmnet ``` The log can be found in `/var/log/wmf-auto-reimage/201808... [12:28:06] uuids is a real rabbithole [12:28:27] generating v4 is easy to do in a high-performance and unique way, but no timestamp [12:28:49] v1's standard format just doesn't leave a whole lot of randomness on the table [12:29:30] if you really need it to be unique, uuidv1 is not the way to go in a high traffic environment [12:29:47] 48 bits are the mac (or other similar thing), there's a 100ns timestamp taking 60 bits, and that leaves at best 20 bits for randomness [12:30:40] but then there's this 14-bit "clock id", which libuuid implements by reading/writing some machine-global statefile (ugh) [12:31:22] it's easy to design a uuid generation method by hand that meets our needs better, but then it would be non-standard when N other service programming language users want to just use their standard library's uuid generator function [12:31:25] 10Traffic, 10Operations, 10Patch-For-Review: Upgrade cache servers to stretch - https://phabricator.wikimedia.org/T200445 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['cp2022.codfw.wmnet'] ``` Of which those **FAILED**: ``` ['cp2022.codfw.wmnet'] ``` [12:34:59] iirc the clock bits are used for higher resolution so that clock bits + randomness minimise id collisions [12:35:26] yeah but if you really use all 14 clock bits for clock bits, there's hardly any random bits left [12:35:27] also, when converting to a ts i don't think they're used at all [12:35:45] it's not too much a stretch to just use all the bits that aren't the ts or macaddr as randomness, but still [12:37:35] more-ideal for a high-traffic environment and a reasonable infrastructure node count would be something like: 56-bits clock (a little better than microsecond res from epoch), 32-bits hashed nodeid, 40 bits truly random [12:38:34] but again, making up custom uuid encodings seems like an antipattern if we're expecting N current and future services to all abide by it in different programming languages. they'd all have to custom reimplement it. [12:41:14] anyways, the actual timestamp part of uuidv1 seems the most important property to preserve [12:41:25] that and the bit indicating v1 at all [12:41:54] that leaves 48+14 bits of mac+"clock" we could play with, and if other services use them in the traditional way it's not the end of the world. [12:42:15] indeed [12:42:21] (we could set them to 32-bits hashed nodeid + 30 bits of good randomness) [12:42:47] in practice, if we were to alter the uuid generation method, we would just need to ensure that entities that can start request chains abide to it [12:42:54] "just"... heh [12:43:27] I think with opentrace, all the sub-layers/services traversed will generate their own UUIDs as well, which are spanned underneath the parent [12:43:52] ah, like sub-req-ids [12:43:55] hm [12:45:12] well maybe 30 bits of nodeid and 32 bits of random. 32 is nice because it's easy to implement a fast LFSR for that and just seed it from /dev/urandom or whatever. [12:45:31] (thinking about efficiency at the varnish/ATS edge I mean, where anything expensive will get expensive quickly) [12:45:35] yeah, no adjustments needed for that [12:46:29] but even if intra-prod services do generate their uuid slightly differently than varnish i don't tihnk it matters if the main req-id gets preserved and carried over [12:47:30] right [13:33:08] vgutierrez, it worked for me [13:33:28] I think I found some other docs or code that made it look like you could send YAML [13:37:30] vgutierrez, if you just feel like implementing pson, or new versions of puppet won't accept YAML, then go for it [13:43:57] vgutierrez, I know we're on puppet 4.8, looks like from puppet 5.0+ they support JSON responses to the metadata call [13:44:06] based on the newer version of that page [13:51:52] ema: role(cache::canary) [13:51:52] include ::role::authdns::testns [13:51:52] interface::add_ip6_mapped { 'main': } [13:51:55] bleh [13:52:16] ema: https://gerrit.wikimedia.org/r/451324 and the one after it (the third can wait a bit) [13:52:29] seem ok? [13:58:47] bblack: yup [14:00:23] thanks! I'll poke at the eqiad decom patch a little later with puppet disabled, etc [14:00:38] the node nodes seem to be doing fine on their own, even though upload only has 7/8 available due to the memory error [14:00:50] s/node nodes/new nodes/ [14:06:11] nice! [14:06:23] cp2022 decided to reboot into pxe again during reimage [14:06:32] I'm now trying with ipmitool -I lanplus -H "$hostname" -U root -E chassis bootdev disk [14:11:12] alright, trying again with a full reimage [14:11:21] (that worked, the system booted from disk) [14:11:43] ema: https://wikitech.wikimedia.org/wiki/Management_Interfaces#Is_there_any_overriding_for_next_boot? [14:11:54] bootdev none to reset any override [14:12:02] aaaah! [14:12:28] I was talking with geh.el yesterday that he got a WARNING from the reimage script of some hosts not getting back into normal after a PXE reboot [14:12:47] and we were wondering if the reimage script should run bootdev none anyway [14:12:50] just in case [14:16:42] volans: maybe do that if Boot Device Selector != "No override"? [14:16:49] I think ipmi can only set the one-shot value, not the persistent bios setting for all future reboots? [14:17:04] the problem in these cases usually is the bios setting for all future reboots is set to pxe-first [14:17:20] bblack: there is persistent within the options [14:17:45] so those are 2 different things [14:18:08] if the reimage script raises a WARNING means that there is some override [14:18:20] and this is unrelated with the default boot order IIRC [14:18:44] ok [14:19:29] but will need to re-check to be 100% sure [14:19:53] volans: ok, so now after d-i I should check the bios boot order I guess? [14:20:16] are you using the reimage script? [14:20:19] yes [14:20:46] if they are not the default you should get: [14:20:46] WARNING: unable to verify that BIOS boot parameters are back to normal [14:21:02] with what bootparam get 5 returns [14:21:51] I didn't get that for cp2022, yet it rebooted w/ d-i after d-i [14:22:07] trying again now if you wanna follow along [14:22:21] /var/log/wmf-auto-reimage/201808081414_ema_28782_cp2022_codfw_wmnet.log /var/log/wmf-auto-reimage/201808081414_ema_28782_cp2022_codfw_wmnet_cumin.out on neodymium [14:22:22] the script checks that after the first puppet run [14:22:29] ack following along [14:22:39] neo or sarin? [14:22:42] neo [14:22:49] bad boy :-P [14:23:04] so inefficient :-P [14:23:26] I like it to fail slowly [14:24:22] installing grub now [14:24:22] I still always use neo because muscle memory. sarin is better for all the same purposes lately? [14:24:57] it's the same but is in codfw, so if you do stuff in codfw or closer DCs is quicker ;) [14:25:05] oh right codfw [14:25:59] ema: so cp2022 right now has Boot parameter data: 8000020000 [14:26:08] the diff compared to a normal one is [14:26:18] - - Boot Flag Invalid [14:26:19] + - Boot Flag Valid [14:26:30] and another on BIOS verbosity [14:27:00] the "default" one is Invalid, if you were wondering ;) [14:27:12] Invalid is right and Valid is wrong, of course [14:28:17] booted into d-i again, powering down [14:29:42] maybe this one require to have a look at the bios settings in the mgmt console [14:30:08] :) [14:30:23] because IPMI says [14:30:24] - Boot Device Selector : No override [14:33:32] yeah, boot sequence was: 'integrated nic', 'hard drive C:' [14:33:35] lol @ C: [14:36:38] all these years and unix still hasn't caught up with microsoft's drive-letter feature! [14:37:17] so many months wasted rambling about "persistent names" and so on [14:37:21] just always call it C! [14:37:50] C:\ is the most persistent pathname in the history of computing! [14:38:08] how predictable is that! [14:38:38] after a chat with the bios, cp2022 now booted from disk [14:39:15] 14:38:59 | cp2022.codfw.wmnet | Started first puppet run (sit back, relax, and enjoy the wait) [14:39:18] I shall! [14:39:40] https://en.wikipedia.org/wiki/Drive_letter_assignment#Origin [14:44:14] I still miss eth0 :) [14:45:32] // 16 Taps: 32 31 28 27 23 20 19 15 12 11 10 8 7 4 3 2 = 0xCC4C4ECE [14:45:36] #define _GET_LFSR(_x) (_x = (_x >> 1) ^ (uint32_t)((0-(_x & 1U)) & 0xCC4C4ECEU)); [14:45:49] ^ just drug this out of ancient pre-git gdnsd code history [14:46:00] interview question! [14:46:07] efficient 32-bit LFSR with maximal period, to do an efficiency inline-C randomness thing for UUIDs in varnish [14:47:02] we'll just have to see the uint32_t startint point from /dev/urandom and then _GET_LFSR() will feed us a pseudo-random sequence of 32-bit numbers [14:47:28] (it's a predictable sequence, but it only rolls back to the start after exhausting the whole set of 32-bit integers) [14:49:00] with a 100ns timestamp and some kind of unique node id, should be more than plenty [14:49:43] (maybe node id should hash in the thread's pid too, since it will probably be a per-thread generator) [14:51:33] I'm not sure where I grabbed the constants from back then, but there's a number of google hits like http://www.onarm.com/forum/3202/ which reference the same one [14:53:57] probably the usual explanation: somebody let their cat run across their (hex) keyboard :P [14:53:59] nice find though [14:57:38] 10netops, 10Operations: Rack/setup cr2-eqdfw - https://phabricator.wikimedia.org/T196941 (10Papaul) @ayounsi all SFP+-10G-LR are in place . [15:00:06] there's a bunch of alternatives, could randomly choose which tap-set to use [15:00:15] (per-thread at startup I mean, for more diversity) [15:02:25] godog: I lied, there's `traffic_server -C verify_config` we can use :) [15:02:53] ema: fantastic, that was very fast implementation [15:04:45] godog: 10/10 would do business again [15:08:26] hehehe [15:08:29] AAAAAA+++++ [15:09:27] unlike +++ATH0 [15:18:36] 10Traffic, 10netops, 10Operations: Use dns100[12] as ntp servers in eqiad networking equipment - https://phabricator.wikimedia.org/T201414 (10ayounsi) Network devices updated. [15:19:46] ugh [15:20:01] thanks puppet for not having stacks of disable messages :P [15:20:02] 10Traffic, 10netops, 10Operations: Use dns100[12] as ntp servers in eqiad networking equipment - https://phabricator.wikimedia.org/T201414 (10Vgutierrez) Awesome, thanks! [15:20:29] ema: some of our disables overlap, but I think it won't matter shortly, or I can fix it [15:20:37] 10Traffic, 10DNS, 10Operations, 10Patch-For-Review: rack/setup/install dns100[12].wikimedia.org - https://phabricator.wikimedia.org/T196691 (10Vgutierrez) [15:20:40] 10Traffic, 10netops, 10Operations: Use dns100[12] as ntp servers in eqiad networking equipment - https://phabricator.wikimedia.org/T201414 (10Vgutierrez) 05Open>03Resolved [15:21:00] bblack: want me to stop the reboots? [15:21:04] no reboots are fine [15:21:28] let me run through the agent_disabled contents and figure out where the intersection is, but the only ones I'm touching are the to-be-decommed cp10 [15:21:33] (or spared or whatever) [15:22:17] I'm guessing you basically put the numa_networking disable on every cache that didn't have numa_networking and is due for a kernel reboot? [15:22:26] yep! [15:23:16] ok I'm stealing 14 of your disables and changing the message to mine [15:23:24] cp[1049-1050,1053-1055,1062,1064,1066-1068,1071-1073,1099].eqiad.wmnet [15:23:38] sounds good! [15:23:41] (the various cp10 about to be reimaged to either spare or test roles) [15:25:21] err and 1063 which somehow got un-disabled [15:26:03] that one was re-enabled by the reboot script [15:26:46] ah [15:26:50] don't reboot those I guess :) [15:27:04] (the only ones that should need it are eqiad cache_misc) [15:27:20] (within eqiad, I mean) [15:27:30] right, I was going through the list now [15:27:53] and 3 of those are done [15:28:13] just cp1045 is left to do in eqiad (needs kernel update, not about to be reimaged for decom/spare) [15:28:33] err wait I got that backwards heh [15:28:51] cp[1051,1058,1061].eqiad.wmnet left to do in eqiad, cp1045 is the only misc there already done [15:29:44] not numa_networking yet for cp1045 though, so we might as well reboot it too [15:29:49] ok [15:30:12] either way, I'm not touching those 4, but I'm about to reimage everything else in eqiad that isn't brand new [15:30:44] alright, I have a long list of non-eqiad nodes to reboot :) [15:32:14] I'm hoping mine will go smoothly, since they're all going to role(spare::system) or role(test), we'll see! [15:41:36] 10Traffic, 10Operations, 10ops-eqiad: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10Vgutierrez) [16:00:49] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` hydrogen.wikimedia.org ``` The log can be found in `/var... [16:02:00] 10Traffic, 10Operations, 10ops-eqiad, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by vgutierrez on neodymium.eqiad.wmnet for hosts: ``` chromium.wikimedia.org ``` The log can be found in `/var... [16:03:56] bblack: regarding lvs2009 and lvs2010, do you want to proceed with them or shall we wait till all of them are available? T196560 [16:03:57] T196560: rack/setup/install LVS200[7-10] - https://phabricator.wikimedia.org/T196560 [16:10:46] vgutierrez: what state are they in now? [16:12:18] bblack: imaged as spare systems I think [16:13:02] hmmm actually not even that, installed and outside puppet [16:13:13] ok [16:13:35] at least they should be imaged as spare systems IMHO [16:13:50] right, either that or we put the right role on them right now [16:14:22] so we have 9 and 10, which are intended to be the low-traffic primary and the universal backup in the new layout [16:15:19] we don't even have the interface IPs mapped out and DNSed and such I don't think, so maybe put them in site.pp as spares for now and then we can sort all that out before reimaging them [16:16:20] ack [16:21:09] vgutierrez: fyi, their switch ports show as down: https://librenms.wikimedia.org/ports/state=down/location=CyrusOne%2C+Carrollton%2C+Texas%2C+USA/format=list_basic/ [16:21:30] ideally that list should be empty so we can properly start alerting if something shows up [16:22:28] XioNoX: not all of them [16:22:35] nic1 is up for both of them [16:22:46] that's expected while lvs puppet role doesn't kick in [16:22:53] yeah, I mean then ones listed [16:23:05] BTW, that naming schema is far from being ideal [16:23:08] ok [16:23:32] vgutierrez: yeah, naming is hard [16:23:57] we don't know the exact name of the interface until the host is ready [16:24:32] so nicX is temporary [16:24:51] ack [16:26:03] this all makes me want to move to tunnel-based LVS so they don't have 4x interfaces in the first place :) [16:26:32] YES! [16:27:03] and it *should* be pretty straightforward to setup [16:27:26] probably not until we at least get to active/active with the routers hashing ECMP to them though, or the 1x10G for an active traffic class might be problematic in edge/attack cases [16:27:34] unrelated, bblack, is the issue discussed in T196477 similar to what you went through for the new CPs ? [16:27:35] T196477: rack/setup/install backup2001 - https://phabricator.wikimedia.org/T196477 [16:27:47] and then there's the MTU issues too, we'd need >1500 MTUs so the tunneling doesn't refragment too [16:28:53] XioNoX: yes, we should really wrap all this up in some meta-ticket [16:29:33] XioNoX: the common TL;DR is "jessie 3.16 installer kernel doesn't have drivers for latest-gen Dell hardware we're installing in general. Please accelerate your stretch conversions because other options are Kinda Hard" [16:33:16] XioNoX: but it looks like in your particular ticket, moritz is offering some possible alternative path that involves using the jessie-backports kernel just for the d-i, I think? I donno how hard that is either or what other complexities would arise [16:34:24] I only pointed out in case some efforts were being duplicated, I don't know the details [16:34:32] (as it looked similar) [16:35:35] ah [16:35:55] robh: BTW, https://phabricator.wikimedia.org/T201522 this is enough? [16:35:59] well for our traffic stuff, the cp servers were the last types we hadn't stretched anyways, and now we're stretching them [16:36:02] others may be more-stuck [16:41:40] vgutierrez: there is a checklist at the start of the #decom landing page [16:41:47] and all decoms need #decom tagged [16:41:51] or i wont know about it [16:42:10] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10RobH) [16:42:21] https://phabricator.wikimedia.org/project/profile/3364/ has the checklist ill paste in for each system [16:43:25] task updated =] [16:43:28] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10RobH) [16:44:11] vgutierrez: so the checklist there now has a few first steps (like confirming offline and remiving from lvs config) that you should likely confirm/check off [16:44:24] but then the rest (particularly everything after non-interrupt steps) either i handle or onsites [16:45:28] robh: right.. those systems are being reimaged as we speak as spare systems [16:45:36] ? [16:45:37] so that's granted :) [16:45:38] why? [16:45:46] why not? [16:45:53] becasue they are being decommissioned [16:45:57] so why waste time reimaging? [16:46:02] they'll be wiped and unracked. [16:46:04] they're not being used anymore, they've public IPs [16:46:13] so set to role spare? [16:46:24] and then i can take them offline right away afterwards or can they not do that? [16:46:33] while they're not being wiped and unracked it's useful to have them tracked with the mininum attack surface as possible :) [16:46:42] role spare would do that right? [16:46:46] yup [16:46:52] so i dont understand why reimage [16:46:57] why not just change to role spare and run puppet? [16:47:03] it's our normal workflow, to avoid timing dependencies [16:47:15] and to ensure it's really gone from all possible prod interaction [16:47:16] oh, ok, seems odd to me and extra work but its your time not mine =] [16:47:21] well... extra work [16:47:24] most folks dont reimage before decom, they set to role spare [16:47:24] reimage to role::spare, then hand off for the real decom steps [16:47:29] and then assign to me but meh, either is fine! [16:47:34] just spawning the lovely wmf-reimage-host :) [16:47:38] it's pretty easy [16:47:50] was just trying to udnersand, i suppose just firing off the script to reimage is easier than hoping a puppet run to role spare works for you [16:47:54] so it works for me =] [16:48:20] I just did it to a crapload of eqiad caches too, but I'm not filing for decom just yet (because technically, we're still early days on the new hosts and there could be an emergency reason to reimage a few right back into service) [16:48:43] no worries, whenever its ready for decom just file the task in #decom. know we'll be applying that checklist on a per host basis [16:48:49] since there are per host checks and removals to do [16:49:07] so feel free to paste said checklist in the task and check off anything already done =] [16:49:15] I reimaged 23 of them, 18 of which are true decom and 5 are being kept in a spare role temporarily [16:49:48] ie: one task for all of the single role systems (like those cps) is fine [16:49:54] 4 more decoms probably coming by early next week at the latest as well, so there will be a total of 22 old eqiad cps ready to come out of racks by then -ish [16:51:34] oh right, I actually didn't put the 5 spares on my list heh, need to kick those off right quick [17:05:33] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['chromium.wikimedia.org'] ``` and were **ALL** successful. [17:05:51] 10Traffic, 10Operations, 10decommission, 10ops-eqiad, 10Patch-For-Review: Decommission chromium and hydrogen - https://phabricator.wikimedia.org/T201522 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['hydrogen.wikimedia.org'] ``` and were **ALL** successful. [17:11:30] bblack: stopping the numa_networking reboots for today, will continue tomorrow [17:12:29] ok [17:12:34] leave them disabled as-is right? [17:12:40] yup! [17:12:44] ok [17:12:56] I put all the ipsec in downtime through tomorrow around now [17:13:04] perfect [17:13:19] basically the puppet-disabled ones are all failing ipsec to the hosts I spared/tested-out, for lack of the puppetization of the shorter ipsec node lists [17:13:29] but they'll fix as they reboot or whatever [17:15:16] ok, cya! [17:16:43] cya! [17:17:00] all that list is now reimaged too, although the final state of role(test) is kind of a mystery to me [17:17:27] 1071-4,99 are in role(test), rest of the non-misc legacy eqiad nodes are in role(spare::server) [17:19:42] vgutierrez: the mystery of the tls stats remains mysterious. It's been 7d + ~40 minutes since the FS-only merge, but even "lsat 5 minutes" stats still show RSA key exchanges at ~0.035% [17:20:00] yup [17:20:02] I'm aware [17:20:02] I would've guessed somewhere was doing a 1w rolling average [17:20:13] but maybe it's longer, I donno, or happening in multiple layers :) [17:20:25] maybe godog can help us figuring it out [17:20:34] yeah maybe [17:23:55] for sure, I can't right now and I'll be off tomorrow/fri, next week tho! [17:28:56] vgutierrez, hi [17:29:01] Krenair: hey :D [17:29:11] vgutierrez, was it necessary to make these changes all in the same commit? [17:29:30] I can split it if you want [17:29:51] but I wanted to add tests, and it required some refactoring [17:30:07] that didn't really need to be done in that commit either [17:31:20] and although I can clarify comments it's not really related to getting that function into a separate file [17:32:29] I may or may not have time to review this tonight [17:33:16] ack [17:46:31] 10netops, 10Operations, 10ops-eqiad: asw2-a-eqiad VC link down - https://phabricator.wikimedia.org/T201095 (10Cmjohnson) [17:48:52] 10netops, 10Operations: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) asw2-a-eqiad now looks like the 3rd diagram (all leafs have at least 1 link to a spine). fpc4 is connected to fpc2/fpc6/fpc7 (removed fpc3 links) fpc5 is connected to fpc2/fpc3/... [17:50:56] 10netops, 10Operations, 10monitoring: Add virtual chassis port status alerting - https://phabricator.wikimedia.org/T201097 (10ayounsi) [17:52:07] 10netops, 10Operations: connectivity issues between several hosts on asw2-b-eqiad - https://phabricator.wikimedia.org/T201039 (10ayounsi) Disabled the VC link between fpc4 and fpc5 to reduce the density of links (cf. T201145#4486602). [18:08:41] 10netops, 10Operations, 10ops-eqiad: asw2-a-eqiad VC link down - https://phabricator.wikimedia.org/T201095 (10Cmjohnson) Received the new cables and swapped fpc1-fpc3 [18:51:05] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10ArielGlenn) If you are going to store things on the dump web servers: You want f... [19:19:23] 10Traffic, 10Operations, 10Performance-Team, 10Wikimedia-General-or-Unknown, 10SEO: Search engines continue to link to JS-redirect destination after Wikipedia copyright protest - https://phabricator.wikimedia.org/T199252 (10Nemo_bis) >>! In T199252#4489377, @ArielGlenn wrote: > You want files to go under... [20:32:50] 10netops, 10Operations: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi) The above changes led to a malfunction of asw2-a-eqiad starting at 17:45 UTC causing: ~35% packet loss to hosts in row A, this also impacted hosts on asw for traffic coming from... [20:36:35] 10Traffic, 10Core-Platform-Team, 10Operations, 10Performance-Team, and 4 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Catrope) >>! In T201409#4484469, @Krinkle wrote: > Added see also: {T193050} and {T147101} In addition to that, do we ha... [20:43:15] https://githubengineering.com/glb-director-open-source-load-balancer/ [23:55:00] 10netops, 10Operations, 10Wikimedia-Incident: asw2-a-eqiad FPC5 gets disconnected every 10 minutes - https://phabricator.wikimedia.org/T201145 (10ayounsi)