[01:04:49] hi [01:09:10] chasemp: andrewbogott nothing else blew up in 2 hours! [01:09:13] wooo [01:38:41] 6Labs, 10Labs-Infrastructure, 6operations, 7Icinga, 5Patch-For-Review: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1886694 (10Dzahn) a:3Dzahn [02:51:40] YuviPanda: woooooooo! [02:54:37] andrewbogott: another two hours have gone by and nothing bad has happened! [04:58:18] is extension1 available on toollab? [05:01:20] 6Labs, 10Tool-Labs: Set up replica from extension1 cluster - https://phabricator.wikimedia.org/T121742#1886832 (10liangent) 3NEW [08:57:27] can anyone accces files under "tools.mavrikant@tools-bastion-01:~$" ? [08:59:28] I am only member of this tool. [09:41:37] Mavrikant: by default, yes. [09:42:11] what if I use chmod 700? [09:42:33] Mavrikant: then you can block other people, but why do you want to do that? [09:42:46] keep in mind you are required to use an open source license for the tools you run [09:42:53] valhallasw`cloud, password [09:44:06] Mavrikant: right. What most people do is that they have a 'password file' or configuration file which is chmod 600 or 660 (r/w for owner or owner+group) [09:45:31] but if the password is in a script, you can also chmod that script [09:45:38] valhallasw`cloud: group, tool-members right? [09:45:50] group are the other members of the tool, yes [09:46:01] the owner is the user 'tools.mavrikant' [09:46:59] ok. I get it. I am new in ubuntu. [09:58:29] Hi! Sometimes I get this error in the logs: "libgcc_s.so.1 must be installed for pthread_cancel to work". It looks like some nodes on the grid are missing some libs. Am I right? How do I get this fixed? [09:59:13] leloiandudu: increase the amount of memory you request for your job [10:00:34] Hrmmm... ok then. Thanks [10:01:29] I'm in doughts because this looks like my tool failed (which is perfectly ok sometimes) and it just cannot produce a managed stack trace [10:01:40] *doubts [10:08:58] 6Labs, 10Tool-Labs: Set up replica from extension1 cluster - https://phabricator.wikimedia.org/T121742#1887125 (10jcrespo) @liangent which specific data (columns) do you need? I do not think it is possible to replicate everything, but for a specific column(s) it could be studied. Please note that this requires... [10:57:28] 6Labs, 10Tool-Labs: Set up replica from extension1 cluster - https://phabricator.wikimedia.org/T121742#1887195 (10liangent) >>! In T121742#1887125, @jcrespo wrote: > @liangent which specific data (columns) do you need? I do not think it is possible to replicate everything, but for a specific column(s) it could... [11:09:38] 6Labs, 10Tool-Labs: Set up replica from extension1 cluster - https://phabricator.wikimedia.org/T121742#1887217 (10jcrespo) You seem to want Flow-related contet. That is doable, technically (because we have the space and performance for that specific component). It requires review by Flow developers, telling... [11:17:59] 6Labs, 10Tool-Labs: Set up replica from extension1 cluster - https://phabricator.wikimedia.org/T121742#1887221 (10liangent) Yes I'm hitting Flow content right now ... but since I'm using extensions "randomly" and I only look at their public interfaces I'm using, I'm not sure if they also try to talk to extensi... [11:44:51] 6Labs, 10Tool-Labs: Set up replica from extension1 cluster - https://phabricator.wikimedia.org/T121742#1887268 (10jcrespo) There are other extensions that store data on x1 (check our configuration files). Not all of those are viable to be replicated to the labs replicas due to load concerns (they cannot be on... [11:55:57] 6Labs, 10Tool-Labs: Set up replica from extension1 cluster - https://phabricator.wikimedia.org/T121742#1887298 (10liangent) I see Echo, CX and Flow there. Since I'm mostly working on content, hitting other items seems not so likely as they're more about user activities. However it's difficult for me to "predic... [12:41:21] 6Labs, 10Attribution-Generator, 6TCB-Team, 6WMF-Legal, 5Attribution-Generator-Release-2.0: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1887351 (10Addshore) [12:41:52] 6Labs, 10Attribution-Generator, 6TCB-Team, 6WMF-Legal, 5Attribution-Generator-Release-2.0: [AG] [Task] Assign 1 IP address to lizenzhinweisgenerator labs project - https://phabricator.wikimedia.org/T121095#1868930 (10Addshore) @yuvipanda, it looks like this is unblocked now :) [13:33:39] (03CR) 10Aklapper: "@Multichill: Is this still relevant / something to update? This patch has been sitting here for five months without review, hence asking. " [labs/tools/multichill] - 10https://gerrit.wikimedia.org/r/223671 (owner: 10Multichill) [13:52:01] andrewbogott: I guess moving here is sane so yuvi can see but in essence for posterity: we need to stop using salt as a kind of global run mechanism on labs because it's no where near [13:52:12] and it's wildly inconsistent even when it's sort of sometimes right [13:52:36] and it has some design limitations afa butting heads with projecs on their own salt masters [13:52:46] and multiple master scenarios is basically impossible [13:54:15] chasemp: yeah, I agree… we don’t have a good replacement yet do we? [13:54:23] (Yuvi said something about using a different tool but I didn’t follow.) [13:54:40] I have 2 ideas in mind but am basically looking to get consensus among us three at least [13:54:44] that we need to move on [13:54:59] so yesterday I walked through a bunch of consistency exercises [13:55:38] VM count for deleted_at is null from m5 was 698 [13:55:55] salt all instances (not even responding ones necessarily) is 578 [13:56:11] salt instances it thinks respond is 530 [13:56:34] sort of [13:56:41] so salt has baked in mechanisms for determining valid minions [13:56:54] and I ran through a lot of how that works and whawt the deal is but this is demonstrative [13:56:56] 526 526 20187 eight.txt [13:56:58] 530 530 20349 five.txt [13:56:58] 530 530 20349 four.txt [13:56:58] 520 520 19987 nine.txt [13:57:00] 511 511 19657 one.txt [13:57:02] 530 530 20349 seven.txt [13:57:05] 530 530 20349 six.txt [13:57:06] 531 531 20391 ten.txt [13:57:08] 530 530 20349 three.txt [13:57:11] 516 516 19832 two.txt [13:57:20] that is from the built in salt mechanism for determining how many minions exist (manage.up not test.ping) [13:57:36] so even if the 530 was even close the neighborhood it's wildly inconsistent with host communications [13:57:56] I talked to ariel a lot about this [13:58:09] and our master hasn't been getting any of the patches as prod has (I guess) [13:58:12] I don’t think you need to convince anyone that salt is unreliable about collecting minions. [13:58:36] sure, just demonstrating it as more than conjecture and "group knowledge" [13:59:06] https://dpaste.de/hQya [13:59:12] something I collected a month ago [13:59:41] hey guys [13:59:42] ok so, when I say it's worse than useless I mean if you run a salt command you don't know at all what happened [13:59:45] as it's not even wrong consistently [13:59:52] yeah [13:59:56] we've had a number of different labs issues lately but we haven't followed up with documentation about them [14:00:26] a lot of it is in emails but still not proper incident reports [14:01:01] true, which events qualify for incident reports do you think? the labstore failure definitely, ldap crash friday night? [14:01:11] paravoid: yeah, I need to write an incident report about the pam stuff. I’m hoping someone closer in can write the ldap-perf and nfs-failure reports [14:02:07] yeah, NFS outage + PAM stuff is what I was thinking [14:02:27] but the LDAP one could be another one yeah [14:03:04] and we've also had the tools.wmflabs.org outage due to fa.m.wikinews, which I'm not sure qualifies at this point [14:04:27] I can start the PAM one, but I you may have the best context for both the ldap crash and labstore paravoid [14:05:08] I can do the ldap crash [14:05:21] for the labstore one... I know what happened from the middle onwards but I have no idea how it started [14:05:34] regarding the wikinews issue, I’d think not. But I don’t have all of the different performance issues untangled in my mind. Some were wikinews, some were ldap performance, all mixed together into a great wash of ‘sometimes things are slow' [14:05:44] YuviPanda: was there for the labstore thing start to finish [14:06:24] as for the wikinews issue: that took down the front page, but not any tools. It's something we could expect to happen again on a larger scale, though (say, enwiki loading javascript -- that definitely happened before). I filed https://phabricator.wikimedia.org/T121233 on monitoring, but I don't think a full incident report is necessarily useful [14:06:34] ok YuviPanda ping: labstore crash incident report when you awake puhlease [14:06:45] valhallasw`cloud: nod, thanks [14:06:49] I'll give him a heads up when it's a sane time on the west coast [14:07:30] I can't find any docs on the previous issue, though :/ [14:10:40] 6Labs, 10Tool-Labs: Implement metrics for tool labs (under NDA?) - https://phabricator.wikimedia.org/T121233#1887481 (10valhallasw) For reference, this happened earlier with wdsearch on enwiki: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-tech/20131204.txt / http://bots.wmflabs.org/~wm-bot/logs/%23wikimed... [14:12:13] so andrewbogott what I mean is when I see the PAM rollout calling on salt to run cleanup and for deployment, or LDAP configuration that is stale on hosts which we discovery are dark to us, or the "we can flip the dns switch but not really ensure even what VMs have what DNS" issue, on it's best day salt is seeing 75% of VMs [14:12:17] and that only goes down [14:12:24] as it does its thing [14:12:39] I'm not proposing a solution, I'm more looking for consensus on pursuing one [14:12:49] I have more to say on the "salt with be better" front [14:12:53] after talking to ariel as well [14:33:48] yeah, I think at the last Ops off-site we agreed that salt was crap and needed replacing [14:33:57] Or, that’s what I remember at least, maybe that was wishful thinking [14:44:38] ok, I was a bit more fuzzy on it, I thought were wwaiting on newer versions with some idea salt will eventually we good and we should stick it out [14:45:01] I would like to say, salt will never solve this problem in labs so let's stop pretending it does as it's doing more harm than good and remove it :) [14:46:33] 6Labs, 10Tool-Labs: Web nodes complain "su: Permission denied" - https://phabricator.wikimedia.org/T121765#1887563 (10scfc) 3NEW [14:50:33] paravoid: any thoughts about ^ ? [15:03:39] andrewbogott: https://gerrit.wikimedia.org/r/259705 [15:03:51] probably.... [15:30:05] 6Labs, 10Tool-Labs, 5Patch-For-Review: Web nodes complain "su: Permission denied" - https://phabricator.wikimedia.org/T121765#1887644 (10Andrew) @scfc, the above patch ought to have resolved the issue. Can you confirm? [15:54:44] chasemp, andrewbogott: let [15:54:53] chasemp, andrewbogott: let's stop the opendj servers now? [15:55:03] kk [15:55:55] ok! [15:56:03] going ahead [16:04:37] chasemp, andrewbogott: weird, the puppet run didn't stop opendj for some reason, I've stopped them manually (and double-checked that another puppet run doesn't start it either) [16:05:08] huh well good riddance to bad rubbish [16:06:42] huh [16:09:23] opendj is stopped on both and I've also silenced the icinga checks, we can keep it as-is and if nothing further comes up, decom/repurpose the servers next week [16:09:46] 6Labs, 7Shinken: Labs Shinken complains about no more existing host integration-t102459 is DOWN - https://phabricator.wikimedia.org/T121767#1887675 (10hashar) 3NEW [16:11:45] 6Labs, 10Tool-Labs, 5Patch-For-Review: Web nodes complain "su: Permission denied" - https://phabricator.wikimedia.org/T121765#1887688 (10scfc) The daily cron scripts run around 7:00Z, so I'll confirm and close this task tomorrow morning. [16:13:23] 6Labs, 7Shinken: Labs Shinken complains about no more existing host integration-t102459 is DOWN - https://phabricator.wikimedia.org/T121767#1887692 (10Krenair) Is this basically the same thing as T111540 ? [16:19:21] moritzm: are you creating tickets about dcomming neptunium and nembus? [16:21:25] andrewbogott: opening up the firewall on stashbot-deploy doesn't seem to have allowed Puppet to fix the ssh issues on the other hosts in that project. [16:21:27] I'll do that, but let's wait until next week (just for the very unlikely case that we do need to re-enable opendj in the next days) [16:23:43] yep [16:24:00] bd808: I’ll have a look [16:25:24] 6Labs, 7Shinken: Labs Shinken complains about no more existing host integration-t102459 is DOWN - https://phabricator.wikimedia.org/T121767#1887716 (10hashar) Yeah they are related to garbage collection when a labs instance is deleted. I am assuming the Shinken conf is regenerated daily though and `integration... [16:32:14] bd808: on stashbot-elastic01 at least, puppet is broken [16:32:21] "Could not find data item logstash::cluster_hosts in any Hiera data file" [16:33:38] hmmm... that should be in the wikitech hiera data I think, but I'll look [16:35:54] yeah, same failure on all three clients [16:36:13] andrewbogott: cool. I'll fix that [16:36:49] (looks like that has been broken since October) [16:37:04] bd808: also on stashbot-deploy it starts redis on every run. I don’t know if that means redis is crashing or just that the puppet config is stupid [16:37:12] but if you have redis issues that’s a good place to start [16:41:12] andrewbogott: thanks for the debugging help. Obviously I should be doing a better job of keeping track of puppet agent status [16:41:58] bd808: I think it’s a systemic issue. Basically puppet failures should automatically nag project admins. [16:42:57] that would be nice. I had some stuff in deployment-prep at one point (might still be there?) that pushed puppet run reports into Logstash. That was helpful when I was staying on top of that project [16:43:39] 6Labs: Any puppet failure on a labs instance should send an email to project admins - https://phabricator.wikimedia.org/T121773#1887766 (10Andrew) 3NEW [17:14:47] bd808: got your logins back? [17:15:29] andrewbogott: checking... [17:16:07] \0/ yes. thanks [17:46:58] andrewbogott: chasemp paravoid (popping in before breakfast) I was there the entire time, but my actions during the setup was just 'omg we should call mark!' at the start and 'ok, start-nfs!' at the end [17:47:25] can probably write a skeleton, but I think anything by non-mark/paravoid combo is going to be incomplete [17:47:35] YuviPanda: well, that will make for a nice template :) Then we can drop it on mark. [17:47:49] indeed [17:48:09] now I'll run off to eat food and go to the office in an attempt to unfuck my sleep cycle [18:22:45] 6Labs: Any puppet failure on a labs instance should send an email to project admins - https://phabricator.wikimedia.org/T121773#1888218 (10Andrew) [19:40:25] YuviPanda: hit me up when you surface at the office [20:15:39] andrewbogott: in case you were remotely interested, I fixed the redis issue on stashbot-deploy. There was a running redis instance that systemd had lost track of so each time Puppet tried to restart it died because the port was in use. [20:16:04] huh. Well, good :) [20:31:34] YuviPanda: it seems toolschecker does not check for jobs in Error state :/ [20:32:41] valhallasw`cloud: yeah :( [20:32:47] valhallasw`cloud: I thought I cleared error state for all queues? [20:32:56] YuviPanda: jobs can also be in error state [20:32:59] chasemp: am here for about 20min before I head to the office. wassup? [20:33:09] valhallasw`cloud: oh, I thought that was just inherited from queues. [20:33:16] should I clear 'em? did you clear 'em? [20:34:02] YuviPanda: I just did for crosswatch [20:34:14] uh, is wikibugs dead [20:35:33] oh, crosswatch is just not being reported here [20:35:55] YuviPanda: https://phabricator.wikimedia.org/T115261#1888551 [20:36:30] YuviPanda: was going to harrang you for the tools talk [20:36:37] maybe tomorrow would be better or friday tho [20:37:13] chasemp: yeah, probably not today. I've a 2h meeting later today for fixing the underlying issue I bandaided during the ORES outage [20:37:20] chasemp: put it on my calendar tomorrow maybe? [20:37:27] yes will do [20:37:32] valhallasw`cloud: I see a bunch of them with E [20:37:37] I'm going to -cj them all [20:37:39] wait [20:38:04] ok [20:38:07] waiting [20:38:32] at qstat -u '*' | grep E | tail -11 | awk ' { print $1} ' | xargs -L1 echo [20:38:39] except instead of ech I'd do qmod -cj [20:38:56] yeah I just want to check the error reason [20:39:01] also let's make a task for this [20:39:15] to check for tasks in E? yeah [20:39:23] shouldn't be too hard to add to toolschecker [20:39:51] 6Labs, 10Tool-Labs: Clear job error states after NFS outage - https://phabricator.wikimedia.org/T121798#1888568 (10valhallasw) 3NEW [20:39:56] no for the actual clearing [20:40:00] ah [20:40:02] ok [20:40:26] andrewbogott: is the staging project all gone now? [20:40:42] YuviPanda: I think most but not all instances were deleted. [20:40:47] I havne’t looked though. [20:40:51] that's a good reduction in the number of hosts with role::puppet::self applied :D [20:41:18] and once we migrate bd808's stashbot to tools over the next few months that'll give us some more [20:41:27] 6Labs, 10Tool-Labs: Clear job error states after NFS outage - https://phabricator.wikimedia.org/T121798#1888578 (10valhallasw) ``` $ qstat -u "*" | grep "tools.* E" | sed -e 's/ 0\..*//' > jobids root@tools-bastion-01:~# for i in `cat jobids`; do qstat -j $i | grep -e "error reason"; done error reason 1:... [20:41:28] to clarify - per-project puppetmasters that are using role::puppet;:self [20:41:32] YuviPanda: ^ only half of them are NFS related [20:42:02] valhallasw`cloud: I think the other half are stuck with that error message from an *LDAP* outage [20:42:03] YuviPanda: s/months/weeks/ [20:42:06] bd808: \o/ [20:42:12] bd808: do you want to do it the regular way or in k8s? [20:42:36] k8s still has rough edges, but I think it's definitely way better than trebuchet ;D [20:42:48] * YuviPanda continues keeping an eye on deis for the PaaS stuff [20:42:59] YuviPanda: right, so clearing should be ok [20:43:12] valhallasw`cloud: ok, shall I do so then? [20:43:18] YuviPanda: sure! [20:43:30] please copy the command you used to the task and close it ;-) [20:43:32] YuviPanda: the frontends are already on tools. The part I need to replace is the role that Logstash plays now. I have a bot written by not tested yet. [20:43:51] It might be nice to put that bot on k8s so it doesn't croak during nfs events [20:44:04] because having a dead SAL is never fun [20:44:08] 6Labs, 10Tool-Labs: Clear job error states after NFS outage - https://phabricator.wikimedia.org/T121798#1888580 (10yuvipanda) 5Open>3Resolved a:3yuvipanda ```yuvipanda@tools-bastion-01:~$ qstat -u '*' | grep E | tail -11 | awk ' { print $1} ' | xargs -L1 qmod -cj yuvipanda@tools-bastion-01.eqiad.wmflabs... [20:44:18] bd808: +1. grrrit-wm and nagf didn't croak [20:44:27] PAWS croaked because it has an explicit NFS dependency (needs shared folders) [21:23:51] YuviPanda: as an aside, I feel phabricator tasks may be a more effective way to communicate than the SAL. It's easier to provide context, and afaik the entire team gets cc'ed on new tool-labs tasks [22:18:42] 6Labs, 7Shinken: Labs Shinken complains about no more existing host integration-t102459 is DOWN - https://phabricator.wikimedia.org/T121767#1888901 (10hashar) The IRC bot `testing-shinken-` does report the same error. I kicked it out of #wikimedia-releng . [22:28:43] valhallasw`cloud: or the labs list :-} [22:31:10] heh, yes, or that :-) but most of the things in the SAL are probably not too relevant there :-) [23:44:26] 6Labs, 10Wikimedia-Stream: Provide useful diffs to high-volume consumers of RCStream - https://phabricator.wikimedia.org/T100082#1889300 (10DarTar)