[08:09:30] volans: is there an accepted approach to adding unit tests to python files in the puppet repo? [08:12:02] kormat: I'm not yet around but the quick answer is take the code out of puppet in its own project :-P that said we have few examples, none of which are great IMHO, can link them in a bit [08:16:03] klausman: remember that issue with os.sendfile() and sys.stdout? it's stopped happening on my machine. w.t.f. [08:18:21] i haven't rebooted, i haven't updated python since before i ran into this issue [08:21:00] huh. some of my terminals have stdout opened with O_APPEND, and some don't. [08:21:34] yeah, that determines it. i've no idea why the difference. [08:23:05] https://phabricator.wikimedia.org/P12562#70292 ssh changes the flags... [08:27:46] weeeird [08:28:52] But it does something similar here [08:29:00] Before: flags: 0100002 [08:29:06] After: flags: 0102002 [08:30:31] it only seems to happen about 50% of the time for me [08:31:29] Hang on. This might not be SSH [08:31:58] I just tried producing the change 4-5 times in a row without success. Then I used *tab completion* once, and it appeared [08:32:29] yesss. that's it. [08:32:37] though not on, e.g. ls. [08:32:57] so maybe the host tab completion does something funky [08:33:07] Havin' a look-see [08:33:25] ping cum also does it [08:33:40] I suspect it's what ever looks in known_hosts [08:37:34] tmpkh=($(awk 'sub("^[ \t]*([Gg][Ll][Oo][Bb][Aa][Ll]|[Uu][Ss][Ee][Rr])[Kk][Nn][Oo][Ww][Nn][Hh][Oo][Ss][Tt][Ss][Ff][Ii][Ll][Ee][ \t]+", "") { print $0 }' "${config[@]}" | sort -u)) [08:37:36] Lovely. [08:40:22] <_joe_> my eyes!!!! [08:41:03] <_joe_> kormat: look at what was done for mcrouter's certificate verification script, I would say [08:46:51] _joe_: ack, will do. [08:58:38] kormat: fwiw, by "breaking" my terminal using tabcomp on a hostname, I can make your python script fail as well [08:58:43] omg [08:59:06] klausman: good good :) [08:59:36] * apergos checks right back out of this channel [08:59:41] But I have been entirely unable to find anything in the bashcomp function used that would explain thus [09:02:12] Now installing zsh to see if it has the same behavior [09:03:24] <_joe_> ahah this is golden [09:03:41] <_joe_> "python script broken by tab completion" [09:03:49] :D [09:04:45] zsh does *not* do this [09:05:59] This might be a *readline* issue, not bash [09:07:07] <_joe_> would make sense [09:07:22] At any rate: WONTFIX UPSTREAM :-P [09:07:49] <_joe_> definitely [09:08:12] <_joe_> but I think kormat can verify the stdin flags and set them appropriately in her script [09:08:49] _joe_: i could, but what i've done is just open /dev/stdout and use that [09:09:21] rather than modifying the terminal flags, which seems naughty [09:09:41] Well, bash/readline futzing with them like it does is naughty as well [09:09:51] klausman: indeed [09:10:08] But I do get if you don't have the energy to pester upstream about it [09:10:31] figuring out which upstream is already non-trivial :) [09:12:00] <_joe_> yes [09:12:56] https://sourceware.org/bugzilla/show_bug.cgi?id=14292 One of my fave bugs I ever reported. Bug in glibc starting in 2001, I found it in 2012, it got fixed in 2015 [09:13:17] (and the fix was a literally a 1-character change) [09:16:02] klausman: it's awk [09:16:19] Oh. So the line up there even is the culprit! [09:16:28] yep [09:17:21] running `awk '' /etc/ssh/ssh_config` is enough [09:18:31] ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0 [09:18:43] Terminals. They will be the end of us all. [09:22:32] https://git.savannah.gnu.org/cgit/gawk.git/tree/io.c?h=gawk-5.0.1&id=ef83f3a14e5d3d2c446bdbdb801ba14345fc27ed#n983 [09:24:10] Why, tho/ [09:25:14] kormat: I came back to elaborate on the above question, but I see you're doing a deep dive into much more important problems :-P [09:25:52] :D [09:35:30] kormat: so we have a way to run per-module tox on puppet [09:35:43] and so yes it's possible, but it's a bit of a mess [09:36:00] and if a project grows more than one file it should really go into its own repo [09:36:32] volans: the example i'm thinking of is my binary-search script. it should only be a single file, but it would be nice to have some tests [09:36:42] and having a separate repo for it + packaging seems like way overkill [09:36:48] indeed [09:37:59] if you look at rake_modules/taskgen.rb (search for tox_ini) [09:38:03] you can see what it does [09:38:41] is your file in a module? [09:38:50] or in profile [09:39:08] so far neither [09:39:44] i had originally started adding it to operations/software.git, but then realised that deploying from there is a nightmare [09:50:08] yeah [09:51:28] i guess making a separate puppet module for this makes sense. it's a fairly general script, so it could be useful elsewhere in the future [09:52:22] <_joe_> +1 [12:50:38] what's up with restbase? [12:50:55] it's restless(base) [12:51:00] I was asking in -serviceops [12:51:11] and the wikitech page is empty [12:51:25] Did pdu work start? [12:51:36] no idea [12:51:45] didn't see any chat about them [12:51:46] yet [12:52:55] also why we're trying to get 'aggregated feed content for April 29, 2016' ? [12:53:13] C4 and C5 don't list any restbase server [12:53:13] a call to https://en.wikipedia.org/api/rest_v1/feed/featured/2020/09/15 seems to return some results [12:53:21] (the ones scheduled for today) [12:53:29] so I am doing a roll restart of cassandra on aqs, and I see from the restbase grafana page that some 50x are returned [12:53:32] i think this happened a month ago and there was an issue wit the alert fetching the wrong page [12:53:34] https://grafana.wikimedia.org/d/000000068/restbase?orgId=1&from=now-3h&to=now [12:53:48] I am not sure what feed featured is but maybe uses aqs [12:54:02] checking cassandra on aqs [12:54:08] (the cookbook is running) [12:54:27] moritzm: today it is: C4, C5, C2 and C3 [12:54:33] recoveries comming in now [12:54:48] I'm seeing some indexing error for logstash, that's aqs error logs for 'operation time out' talkign to cassandra FYI [12:55:05] I'll followup in a task cc elukey [12:55:21] yes I think it is my roll restart, but not sure why [12:55:25] marostegui: ah, yes, but also without restbase* in C2/C3 [12:55:35] cool [12:57:38] kormat: I think it would be reasonable to make `doctest` work for Python in the puppet repo, fwiw [12:58:20] godog: I am trying to understand if wikifeeds uses aqs behind the scenes, if so the alarms are explained [12:58:40] but it is not great [12:59:08] (it would be not great that aqs not responding temporarily causes a storm of health check failures) [13:00:11] cdanis: so reasonable that you'll send a CR? :) [13:00:20] ah snap alarms again [13:00:31] kormat: I don't know where you get such ideas [13:01:07] elukey: not sure if related or not, the alerts I'm talking about are for malformed logs which can't be properly indexed, opened https://phabricator.wikimedia.org/T262920 [13:01:10] cdanis: i should report myself to sobanski for being overly-reasonable [13:01:32] godog: ah ok sorry I thought you were referring to the storm of restbase alerts [13:02:33] kormat: unacceptable [13:02:42] elukey: np! but yeah maybe related, no idea atm [13:02:44] /o\ [13:04:22] <_joe_> elukey: so wait, wikifeeds calls aqs? [13:04:34] <_joe_> why is that not in the config for wikifeeds? [13:04:40] <_joe_> oh because it calls it via restbase [13:04:45] nono I have no idea, I am only saying that it matches [13:04:46] * _joe_ cries in a corner [13:04:52] elukey: I don't think it does... [13:05:05] or... maybe I am wrong, let me make sure [13:05:10] see https://grafana.wikimedia.org/d/000000068/restbase?orgId=1&from=now-3h&to=now [13:05:32] it's the "most read" endpoint that is complaining [13:05:38] so this is starting to sound more plausible [13:05:42] yeah [13:06:16] I checked the spickerack logs and there are only two nodes left/in-progress to do, 2 instances each, the cookbook should finish soon-ish [13:06:47] maybe the defaults for the cookbook are a little aggressive for aqs [13:06:51] will investigate [13:06:53] <_joe_> yes [13:06:54] this is /v1/page/most-read [13:07:05] so looks like AQS indeed [13:07:14] <_joe_> I think restbase calls wikifeeds that calls restbase that calls aqs [13:07:14] making sure in restbase code/configuration [13:07:16] <_joe_> ok [13:07:25] <_joe_> akosiaris: wikifeeds has the same issue btw [13:07:44] damn I hate the interconnected mess that is restbase [13:07:53] <_joe_> tell me about that [13:08:35] I know I will regret to have asked it, but why the icinga alert reports 'aggregated feed content for April 29, 2016'? [13:09:03] it's probably the x-amples part of the spec [13:09:08] that date has been chosen [13:09:09] <_joe_> volans: it's the description of the test in the spec [13:09:12] _joe_: here, maybe this will help you feel better -- sound on 🔊 https://twitter.com/hitsugichan/status/1305700778695327744 [13:09:38] ack [13:09:57] would be nice to add some info in https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase or a link any related existing docs fwiw [13:10:47] <_joe_> I didn't even know that page existed [13:11:12] I think are the auto-generated ones for generic alert [13:11:15] <_joe_> and indeed [13:11:18] IIRC Daniel worked on that [13:11:24] don't recall all the details sorry [13:11:26] cdanis: I am playing it in a loop [13:11:28] <_joe_> so yes, tell him. [13:11:30] if (req.params.domain === 'fy.wikipedia.org') { [13:11:30] return BBPromise.resolve({ meta: {} }); [13:11:30] } [13:11:36] elukey: <3 [13:11:38] I am not sure I even want to know... [13:12:47] <_joe_> elukey: so once the aqs reboot is over, please let us know [13:12:54] <_joe_> so we can check if the issues continue [13:13:05] yeah it should finish soon, I am tailing logs [13:13:13] should be down to the last node [13:13:24] [DEBUG clustershell.py:254 in ev_pickup] node=aqs1009.eqiad.wmnet, command='c-foreach-restart -d 10 -a 20 -r 12' [13:14:04] I 've just verified that it's indeed AQS, wikifeeds wants to call the pageviews API [13:14:10] which is indeed AQS [13:15:46] <_joe_> now, why can't wikifeeds call aqs directly? [13:15:54] <_joe_> given aqs is basically restbase as well? [13:17:34] cause "it's the API" [13:18:22] cookbook done [13:21:13] <_joe_> elukey: https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?viewPanel=6&orgId=1&var-datasource=codfw%20prometheus%2Fops&var-origin=restbase&var-origin_instance=All&var-destination=search-https_codfw&var-destination=wikifeeds [13:21:28] <_joe_> I am so happy we have this kind of visibility now [13:22:13] nice [13:25:21] hello! am talkingg about stream processing stuff with some search team folks today [13:25:37] i've mentioned potential SRE use cases before. [13:25:40] q for all: [13:25:42] <_joe_> ottomata: define "stream processing" [13:25:45] ok [13:25:50] elukey: also https://grafana.wikimedia.org/d/35vIuGpZk/wikifeeds?viewPanel=20&orgId=1&var-dc=codfw%20prometheus%2Fk8s&var-service=wikifeeds [13:26:11] thanks! [13:26:13] <_joe_> ottomata: if you mean having an infra that reacts to events, I think we should go with something native to k8s. [13:26:16] https://en.wikipedia.org/wiki/Complex_event_processing [13:26:34] and or [13:26:34] https://en.wikipedia.org/wiki/Event_stream_processing [13:26:38] <_joe_> an aws-lambda like model [13:27:04] but more generally: if SRE could easily run SQL on streams in kafka, would that be usefful? [13:27:12] for things like DDoS detection or the like? [13:27:17] <_joe_> yes [13:27:31] <_joe_> oh so you mean real-time processing of data in a stream [13:27:35] _joe_: def +1 to k8s stuff [13:27:37] <_joe_> so flink [13:27:58] flink or kafka streams or something, but probably flink given licensing issues with KSQL [13:28:05] <_joe_> ottomata: for most stuff that's not analytics, I think something like knative or kubeless are probably the direction we want to go into [13:28:40] <_joe_> if the pattern is "watch for events, react to the event" then something like that should be our choice [13:28:50] <_joe_> or in the meantime our dear-old changeprop [13:29:00] :) [13:29:39] _joe_: that sounds like just simple triggering on events though, right? [13:29:48] <_joe_> more or less, yes [13:29:58] not somethjing like: join these 3 streams on this key over 10 rolling seconds windows and look for this pattern and then alert [13:30:03] <_joe_> not just triggering, ofc, but yes it's not stream /analysis/ [13:30:13] <_joe_> yeah def not [13:30:32] is ^^ something that could be useful to SRE, if it was not painful to do? [13:30:34] <_joe_> ok thanks I wanted to understand the boundaries [13:30:43] <_joe_> it seems interesting to me [13:30:45] <_joe_> cdanis: ^^ [13:30:57] yes, I am definitely interested [13:31:19] <_joe_> ottomata: add us! [13:31:20] I am interested as well. As long as it's usable during an outage it could prove useful [13:31:25] _joe_: to meeting? [13:34:49] would something like that also be useful for anything other than alerting? [13:35:19] FYI c3 one side going down now [13:35:25] I mean C2 [13:36:42] ottomata: potentially, yeah [13:36:45] <_joe_> ottomata: I guess at some point yes [13:37:50] ok, for concrete use case stuff, for the formal example of complex stream joining windowing alerting during an outage [13:38:15] what's a more concrete way of stating that use case? [13:38:30] like; during an outage suspected to be caused by a ddos [13:39:53] it'd be nice to be able to write streaming sql that joins the webrequest stream with the netflow data by IP over a time window and emits events where count is greater than N ? [13:40:06] (dunno if that is real, just made it up) [13:40:47] it's odd to see both "nice" and "sql" in the same sentence :) [13:42:26] <_joe_> kormat: try sparql [13:42:37] <_joe_> ottomata: we have several use cases [13:42:48] <_joe_> one could be outlier detection and adaptive rate-limiting [13:43:01] _joe_: 😮 [13:43:11] <_joe_> kormat: or wait, try graphql [13:46:17] hmm adaptive rate limiting interesting [13:46:20] ok thanks all! [14:18:44] pulling power to C3 now [14:21:28] cmjohnson1: c2 is fully done? [14:22:08] marostegui c2 power is fully restored, the pdu linking and setup is not done but that will not affect power [14:22:57] cmjohnson1: excellent, thank you. Going to restart the services we have there. If you have a chance, db1088 is showing with one power supply as disconnected [14:25:59] ms-be1024 and kafka-jumbo1004 too, same alerts [14:28:50] _joe_: any preferences on naming for this binary-search-of-ordered-logfiles script? i'm thinking `bsection` [14:29:06] <_joe_> csection sounds better [14:29:08] <_joe_> but ok [14:29:43] 🐝section [14:29:44] :) [14:32:54] <_joe_> kormat: is your tool just selecting timeranges or it's a more generalized binary search tool? [14:33:19] _joe_: you give it an (ordered) prefix, it will output all lines that start with that prefix [14:33:27] it doesn't care about what the prefix contains [14:33:29] <_joe_> ok [14:33:40] <_joe_> so yes what godog said [14:33:52] <_joe_> I dunno if utf-8 is allowed in puppet classes names [14:33:59] only one way to find out [14:34:20] hold my unicode, I'm going in [14:35:49] <_joe_> thankfully it does not [14:36:58] _joe_: does the mcrouter tox env actually work? it seems to be telling me that `rm` is not installed [14:37:25] https://phabricator.wikimedia.org/P12590 [14:38:13] kormat: have you added rm to whitelist_externals ? [14:38:20] that whitelist binaries outside of the venv [14:38:34] volans: i haven't touched the config at all [14:38:40] <_joe_> kormat: it worked last time I used it for sure [14:38:50] <_joe_> it caught a typo, but that was years ago [14:38:50] _joe_: what year was that? [14:38:56] <_joe_> like 1 year ago at least [14:39:12] haha, i see [14:40:19] it looks like modules/envoyproxy contains a working python test env [14:40:59] <_joe_> see? I didn't want to point you to my copy of what was done in mcrouter [15:03:02] marostegui: all power has been restored to c3 [15:03:53] cmjohnson1: thanks, db1090 reports one power supply disconnected, if you can double check? thank you! [15:05:49] moving on to C4 now, removing power from one side [15:15:32] two hosts powered down in C4 now it looks like, ms-be1024 and thanos-be1003 [15:16:48] (back now) [15:20:49] godog both power supplies were plugged into side a for both servers [15:21:41] heheh that'd explain, thanks cmjohnson1 [15:46:04] marostegui: c4 all power restored...moving to c5 [15:46:26] cmjohnson1: thank you! [16:35:19] marostegui: power restored c5 [16:36:44] cmjohnson1: excellent,is there any other rack left? [16:38:11] done for today [16:40:12] cmjohnson1: great, thank you! I will restart our mysqls then [19:14:17] I wish i could tell the puppet compiler to run on 'all hosts using a defined type' but that's not possible like it is with actual classes, or is it? [20:44:50] you can get the list of hosts from cumin though :) [20:46:48] volans: good point, yes. thanks [21:15:14] mutante: "I wish I could tell Puppet" t-shirt idea [21:23:56] sukhe: ;) [21:24:41] yeah I've also written puppetdb queries for such, which is useful if you want to see all the params given to all instantiations of a certain class [21:24:59] (recipe in a paste somewhere) [21:32:26] cdanis: is that equivalent to "the proof is too long to fit in this margin"? ;) [21:32:54] I think it's secretly a gripe about Phabricator not having any sort of text (full-text or title-only) for pastes [21:39:14] nobody can tell puppet. [21:42:25] huh. I had never noticed before that the 'query' input in the "advanced search" for pastes is mostly useless. :/ [21:43:22] It seems to a) only search the paste title and b) only return the first matching result [21:45:40] yeah :( [22:46:02] cdanis: dunno why i keep telling them about the phab ticket..duh. maybe we need a backup ticket system outside our own infra [22:46:10] ahah [22:46:39] I just realized it'd definitely be possible to distribute tunnelencabulator as a Windows app [22:47:44] :) [22:48:02] maybe we should use translatewiki to get translations of our "how to report networking issues" page [22:48:18] let the pros handle it for all the langs [22:53:30] I think we should work on the process some, before we do that [22:53:41] I think a bunch there could be simplified [22:54:12] it's pretty clear we have a bunch of users who look at the page, and have a reaction of "nope" [22:56:35] true [22:57:43] I also have some ideas in my head about "there should be a simple executable that gathers this", but that's much harder ofc [22:58:28] the brandnew security.wikimedia.org website could have a drop form maybe [22:58:37] well, scratch that. i meant security-static :o [22:59:01] can somebody quick +1 this https://gerrit.wikimedia.org/r/c/operations/puppet/+/627627 [23:29:21] thanks :) [23:34:00] mutante: https://phabricator.wikimedia.org/T262613#6464857 [23:37:35] cdanis: wow, seems very cool to use RIPE Atlas probes [23:37:46] I don't know why I didn't try it earlier [23:37:53] there were even more than I expected [23:38:19] I'm not going to mess with it right now, but, tomorrow morning I'll add the stats I can (probably just the ping rtt) to our atlas_exporter, so we have some basic graphs [23:38:30] cool! [23:38:30] unfortunately I don't think atlas_exporter groks traceroutes at all though