[06:24:11] volans|off: not sure how to proceed with this https://phabricator.wikimedia.org/P13333 - I guess the DNS changes are from andrewbogott's but the exception raised above...first time I see those [06:37:24] marostegui: ah snap are you decomming? [06:37:30] yes [06:37:55] Not sure how to proceed after those errors :( [06:38:11] so the cookbook needs a little adjustment atm (see the error at line 42), we are working on it with Riccardo [06:38:16] hopefully today should be fixed [06:38:39] elukey: So should I consider the host decommissioned you think? [06:39:49] marostegui: it is basically, the only thing that it might happen is that we'll have to run the decom script again on the same host to clear out inconsistent states (if any) on netbox [06:40:09] marostegui: but if you could stop decomming (if you have other hosts) it would be good [06:40:27] elukey: ah cool, so I will then proceed with the rest (merging puppet changes and all that), and I will wait for you guys to let me know if I should run the script again [06:40:49] elukey: yeah, I have two more hosts to decommission during this week, but I will wait for the green light [06:40:59] super :) [06:41:02] elukey: thank you! [06:41:24] <3 [07:47:08] hello people [07:47:20] there are two alerts for cumin hosts related to backups [07:47:39] remote-backup-mariadb timers [07:47:52] "Backup process completed, but some backups finished with error codes" [07:49:13] elukey: I am opening a task as we speak [09:21:46] marostegui: sorry about that, yes there were some issues and we were fixing them with elukey [09:22:07] volans: sure, no worries, I will wait for the other two hosts :) [09:22:09] thanks for fixing it! [09:22:10] the host is powered off anyway so decomm'ed from that PoV [09:22:18] excellent [09:23:11] marostegui: I've read riccardo's sentence as "yes there were some issues and one of those was elukey" at first [09:23:28] at it would have totally made sense anyway [09:23:34] elukey: which could also make sense [09:23:34] ahahahah, not true [09:23:35] haha [09:23:40] ahahah seee [09:51:25] jbond42: should I push our commit too? [09:51:28] I beat you to it [09:51:44] John Bond: idp: update idp_primary/idp_failover (3ff71fdc20) [09:52:37] yes please [09:53:53] only because you said please [09:53:56] :p [10:02:25] :) [10:02:29] thx [11:35:41] elukey: you around? [11:36:02] arturo: you need someone to blame? [11:36:40] nop :-) someone to help me clarify how to send logs to ELK, specifically which network holes should I need to open to reach kafka servers [11:37:12] arturo: o/ [11:37:17] context: T268176 [11:37:17] T268176: Allow kafka network access to cloudvirt hosts - https://phabricator.wikimedia.org/T268176 [11:37:42] I take it from https://wikitech.wikimedia.org/wiki/Kafka#logging_(eqiad_and_codfw) that we need these ones? https://github.com/wikimedia/puppet/blob/HEAD/hieradata/common.yaml#L617-L628 [11:38:29] * elukey drops all wikilove saved for kormat [11:38:29] or perhaps this should be directed to godog's team? [11:39:04] ah yes in this case I think so, that kafka cluster is not managed by my team [11:39:20] ok, sorry for the noise elukey [11:39:44] arturo: np! [11:41:28] arturo: I'll reply on teh task, but tl;dr all hosts in "logging-eqiad" kafka cluster [11:41:41] marostegui: can you retry the decom cookbook when you have time? Riccardo fixed it, and it works for me :) [11:41:48] sure, for the same host? [11:41:56] (should do the cleanup needed in netbox without removing any other dns etc..) [11:41:58] expect a failure for the wipe bootloaded, was already done earlier [11:41:59] yep yep [11:42:06] ok, trying! [11:42:06] godog: is there a single service address or something? [11:42:56] (/me lunch) [11:43:00] arturo: no, kakfa's happier if it has the list of all brokers afaik [11:43:40] ack [13:24:34] jbond42: i have angered the pcc gods: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26536/console [13:25:30] is it verboten to have multiple roles listed as targets? [13:25:38] kormat: yes unfortunatly you can only pass on Hosts override. [13:25:45] s/on/one/ [13:26:06] huh. but multiple hosts work? [13:26:20] of course it should be handled better [13:26:53] yes multiple hosts work and you can can only use one of re:/some regex/ | {O,C,P}: [13:27:17] ahh, i see. 🥀 [13:37:09] jbond42: having a manual pcc run update the CR with a notification is A+. nice :) [13:38:45] :) thanks [13:55:38] godog: there's something odd with thanos and idp (i think) [13:55:59] i have a thanos page open (https://thanos.wikimedia.org/graph?g0.range_input=2d&g0.max_source_resolution=0s&g0.expr=sum_over_time(mysql_exporter_last_scrape_error%5B5m%5D)%20%3E%201&g0.tab=1 for the sake of completeness) [13:56:13] when i open it, the query works fine [13:56:26] but if i come back to that tab a few hours later, i get `Error executing query: error` [13:57:02] kormat: anything in the console log? [13:57:05] the js log indicates a 403 from idp [13:57:24] reloading the thanos page fixes it for another few hours [13:58:00] is there anything more then just the 403 im wondering if its simlar to https://phabricator.wikimedia.org/T267186 [13:58:57] mmhh yeah I'm guessing the xhr from thanos ui barfs on the idp session renewal, or sth like that [13:59:16] HTTP/1.1 403 Forbidden [13:59:21] vary: Origin,Access-Control-Request-Method,Access-Control-Request-Headers [13:59:57] hm,m that dose look a bit CORS like [14:00:06] any errors in the browser's console ? [14:00:20] godog: that's where i'm getting this from [14:00:20] I'm leaving a thanos page open too in the meantime [14:00:39] kormat: ah! nevermind [14:01:11] there's also a lot of `Unchecked lastError value: Error: Could not establish connection. Receiving end does not exist.` [14:01:46] but that refers to an extension, so i'm not sure [14:03:07] yeah now I'm wondering if jbond42 latest fixes in T267186 fixed the thanos ui issue as well [14:03:07] T267186: alerts.w.o / idp.w.o interaction and CORS - https://phabricator.wikimedia.org/T267186 [14:03:27] (the extension is bitwarden, so probably unrelated) [14:03:33] jbond42: what's the idp session lifetime ? [14:03:58] godog: the updates i added should be pinned to requests that have an Origin: https://alerts.wikimedia.org so it shouldn;t affect thanos [14:04:55] godog: https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration#Session_timeout_handling [14:06:01] jbond42: thank you! reading [14:08:35] mmhh can't find the session lifetime duration at that link, I'm wondering how long before the renewal redirect is supposed to happen [14:09:10] godog: i think it's an RTFM that says RTFC(ode) or something :) [14:09:44] * jbond42 looking for the detail [14:12:20] kormat: hahah I'm getting to the same conclusion [14:14:18] ok i have added https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration#Current_Settings to help a bit [14:15:16] The other bit of the equation is application specific. i.e. thanos in this case creates ist own session which has additional timeouts [14:15:31] thanos uses mod_auth_cas so the timeouts in play there are related to that [14:16:45] i cant see us setting anything specific which means they will tajke the default values for `CASTimeout: 2H` hard time out session will always expire at this point but as long as the cas session is still valid it should be auto renewed [14:16:55] there is also CASIdleTimeout=1h [14:17:25] when theses timeouts are hit the app will either do a 302 redirect to cas and the session will get automatily renewed [14:17:38] or cors kicks in and renewes the session [14:17:53] this latter bit of functionality needs som additional config [14:18:06] btw which browser do you both use? [14:19:28] moritzm: may be able to add anything i may have missed [14:19:30] i'm on firefox [14:20:00] i can also open it in chrome, to have a second test-case [14:20:05] ack thanks , shouldn;t make a difference was just curious [14:20:46] jbond42: thanks for the edit! I'm on chrome [14:21:28] ack thanks [14:26:51] needs some deeper digging, let's open a task? [14:33:14] sure [14:36:03] moritzm: what parameters in an idp url are sensitive? [14:37:55] mm, looks like none [14:38:09] yeah, in the url there's nothing really sensitive [14:38:35] https://phabricator.wikimedia.org/T268233 created with what i have [14:39:21] thx, will have a closer look tomorrow [15:29:08] while I was looking at something else, I noticed: [15:29:20] https://wikitech.wikimedia.org/wiki/Monitoring/dpkg -> dpkg -l | grep -v "^ii" [15:29:38] this currently shows: [15:29:39] rc apt-listchanges 3.19 all package change history notification tool [15:29:51] on the hosts I've tried it on, is this something that needs fixing? [15:32:34] no, that's intentional, Debian installs apt-listchanges by default to show important changes (those shipped as NEWS), but we don't really need it and it can lead to cron spam so was removed back then: https://github.com/wikimedia/puppet/commit/18ec9248 [15:33:22] yeah but why does it show a weird package state? [15:34:05] it seems to ship a conffile, so doesn't get uninstalled completely (to /etc/apt/conf.d, but that is directory which is managed by Puppet with recurse/purge) [15:34:23] I'll make a patch to move it to the list of purged packages [15:35:01] since dpkg thinks the file is present, but Puppet yanked it as it's not known to the catalogue) [15:41:29] https://gerrit.wikimedia.org/r/c/operations/puppet/+/642019/ should fix it [15:43:35] thanks! [15:58:35] Hi SRE folks. I have some +1 [15:59:15] I have some +1'd mods to logspam/logspam-watch which need merging in operations/puppet. [15:59:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/641804 [15:59:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/641825 [15:59:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/641831 [15:59:16] Can someone +2 them for me ? [16:00:55] dancy: hey, happy to look in just a few minutes [16:01:05] Thanks! [16:03:12] dancy: these are ready to merge, right? I'm not going to code review your perl, just stamping them for "doesn't break the puppet repo and/or compromise production" :) [16:03:35] Yep, ready to go. [16:03:51] 👍 [16:05:09] Thank you sir! [16:06:17] thanks for the "rc apt-listchanges" fix, I was wondering that too the other day [16:06:30] sure thing! and all merged at the puppetmaster [16:09:43] kormat: are you around? [16:10:10] or marostegui ? [16:12:59] effie: what's up? [16:13:51] so rzl and I want to run again a long running script [16:14:17] and we want to somehow ensure that the db server we will be talking to won't be rebooted or depooled [16:15:03] but we will know after we start the script :D [16:15:13] kormat: can you help us ? [16:15:27] how long, and how lucky do you feel? [16:16:07] 1 whole week and maybe 1-2 more days [16:16:18] we know it has happened before, so it is possible [16:17:18] effie take one of the vslow hosts I would say [16:17:34] it is not guaranteed of course, but we'll try! [16:17:36] marostegui I am not sure we can choose [16:17:54] right... [16:18:54] effie: there's nothing automated that will reboot/depool a server, it's just us 2 running various maintenance things [16:18:56] we might be able to, but we'd have to get someone who knows more PHP than us to have a look at the script and see :D [16:23:18] can you tell which db server you're talking to? [16:24:07] we can after we start talking to it [16:24:15] not sure, maybe -- in this case we could tell it was db1106 because the script choked and died when it shut down [16:24:30] (which, the script should obviously be smarter than that, and that is obviously not your fault) [16:25:50] ok, well i guess the best we can offer at the moment is you tell us which db instance you end up talking to, and we'll try to make sure we don't do anything naughty with it [16:25:50] but yeah if nothing else we should be able to see what sockets it has open, or whatever [16:26:25] (it's mostly manuel tbh. i try not to harass the machines too much) [16:26:56] okay, much appreciated -- we promise not to run this upgrade this way ever again [16:27:01] which was already the plan, because it sucks [16:27:15] but *this* time we're *actually* never gonna do it again, unlike the last time we said that [16:27:19] what we normally do with long term maintenance [16:27:46] is to ask for it to run it in small batches and keep a counter (e.g. that was how the wikidata migration was done) [16:28:01] that way if it fails or a db has to go down, no work is lost [16:28:10] just FYI [16:28:26] yeah, unfortunately in this case that's tricky -- you can't actually tell by inspecting the table which rows have been hit yet [16:28:41] full context in T264991 if you're interested, and in the long-run we'd like to rearchitect things so that isn't true [16:28:42] T264991: Upgrade the MediaWiki servers to ICU 63 - https://phabricator.wikimedia.org/T264991 [16:28:56] yeah, mentioning it for the future [16:29:09] so we know 80390200 rows had been processed when the script died, but we can't use that information to skip them next time :/ [16:29:11] e.g. keeping a table counter [16:29:26] at least I think we can't [16:29:27] like a PK or username, title et.c [16:29:43] yeah, again just mentioning it for reach that you mentioned [16:29:47] *rearch [16:29:50] yeah appreciate it -- it's already on the roadmap, there's a lot of things about this that shouldn't work the way they do [16:36:35] marostegui, kormat: db1118 is the lucky winner [16:37:34] rzl: alright :) [16:37:52] also 1083 obvs, I have no idea how this thing would handle a master failover but probably not gracefully [16:37:59] we expect a fort to be built around this host [16:38:27] rzl: we don't have any imminent master failovers, at least [16:41:43] not any known ones, at least ;) [16:41:53] quite :) [16:45:15] you are jinxing it [16:45:47] no, this is jinxing it: it's been so long since we had an unexpected s1 master failover, I'm sure it won't happen this week [16:46:00] if you're gonna jinx it, really go all-out [16:46:58] "it's been too long since we've had a massive failure of the entire Juniper virtual chassis system" [16:47:19] kormat: and, thanks :) appreciate you working around us here, even though you shouldn't have to [17:16:42] rzl: One more logspam improvement just came through. When you have a moment please +2 https://gerrit.wikimedia.org/r/c/operations/puppet/+/642035 Already tested by me. [17:17:27] dancy: already stamped and merged it :) [17:17:42] Fantastic. Thanks again! [17:17:48] no worries! [17:51:23] qq - is the logstash event that just happened something that was expected/discussed? [17:51:33] https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&refresh=5m&from=now-1h&to=now shows a big jump in events [17:51:42] herron: --^ (hi!) [17:53:47] hmm, I'm looking at the logstash patch that was submitted right before it [17:54:06] wondering if puppet bounced things at around the same time, usually we roll those out slowly [19:56:28] One small use case for the old DNS repo... i used to also check in there what is a VM and what is not.. and how many VMs are in each row, for example how many are in new row D and stuff like that. that was easy with grep and vi in the DNS repo before netbox generated it. That is kind of missing now.. so I went directly to a DNS server to look at the generated results there but it does not have the [19:56:34] "VM on ganeti.." comments next to hosts anymore. In netbox I can look at the networks/IPs but the description field does not include whether it's virtual. [19:58:47] interesting -- what is the thing that you want to check for? [20:00:32] or do in general :) [20:02:16] paravoid: I wanted to see how many VMs are in each row/network to say something like "please put the new 5 VMs into the new row D, which is much emptier than the others" [20:03:03] in the past I would have looked in 10.in-addr.arpa at those special comments [20:03:26] using gnt-instance list or gnt-node list it is not immediately obvious which row they are [20:04:32] maybe if the "description" column on something like https://netbox.wikimedia.org/ipam/prefixes/101/ip-addresses/ could have "virtual" [20:06:45] "gnt-instance list --output name,pnode.group" should do it [20:07:04] but yeah don't have node groups into netbox, that's a bit separate [20:07:17] or `-o +pnode.group` if you want the default fields plus that one [20:07:18] we could import those as well, not sure if it'd be worth it [20:07:56] that works! thank you :) [20:08:47] a complication of course is that the group is a property of the node, not the VM, so you'd still not get that view from netbox [20:11:10] I'm good with this: for row in A B C D; do echo -n "$row: "; sudo gnt-instance list --output name,pnode.group | grep row_${row} | wc -l; done [20:11:26] so B and D are the empty ones [20:11:53] no need for netbox then for this one [20:13:43] "gnt-instance list --output pnode.group | sort | uniq -c" :P [20:14:04] I was about to paste the same thing but with --no-headers on the gnt-instance invocation :) [20:14:37] :) [21:41:36] mutante: you can simply run "sudo gnt-group list", it will show you all groups/rows along with the number of instances on it