[00:09:35] PROBLEM - ToolLabs: Excess CPU check: iowait on labmon1001 is CRITICAL: CRITICAL: tools.tools-exec-10.cpu.total.iowait.value (66.67%) [00:23:45] PROBLEM - ToolLabs: Excess CPU check: iowait on labmon1001 is CRITICAL: CRITICAL: tools.tools-exec-10.cpu.total.iowait.value (50.00%) [00:29:54] RECOVERY - ToolLabs: Excess CPU check: iowait on labmon1001 is OK: OK: All targets OK [10:59:10] PROBLEM - ToolLabs: Excess CPU check: user on labmon1001 is CRITICAL: CRITICAL: tools.tools-exec-12.cpu.total.user.value (77.78%) [10:59:39] yay [10:59:40] nice [11:00:05] !log tools tested CPU monitoring on tools-exec-12 by running stress, seems to work [11:00:07] Logged the message, Master [11:18:32] Hmm, "SELECT COUNT(1) FROM table" takes 8 hours? :/ (15m rows) [11:40:35] Coren: Around? [11:47:20] RECOVERY - ToolLabs: Excess CPU check: user on labmon1001 is OK: OK: All targets OK [12:17:04] a930913: Am now. What's up? [12:22:25] a930913: 't' state means your job was suspended because it hit some threshold; which would surprise me a little because I don't think I've configured any. What job id is this? [12:30:03] Coren: I think I killed it. [12:30:10] isn't it T for (T)hreshold and t is job is about to run (t)ransferring [12:30:25] Coren: Currently got about 10 jobs on exec 10. [12:30:38] All the other jobs on other nodes finished last night. [12:31:22] phe: Yes, 'T' not 't'. I assumed that's what a930913 meant because our grid doesn't do transferable jobs. :-) [12:32:12] a930913: -exec-10 is in pain indeed. [12:32:26] * Coren looks into it. [12:38:48] Coren: I tried to reschedule them but it said the jobs are not rerunable. Presumably for no continuous flag? [12:38:50] Poor box is trashing like crazy. [12:38:58] I thought as much :( [12:39:39] hmm [12:39:55] a930913: Yeah, no continuous means not rerunable automatically - that obviously doesn't prevent just running them again. [12:40:02] Problem is, I can't see which parameter the job was run with. [12:40:26] Why not? qtstat should tell you. [12:40:40] interesting, I can't get a working shell on -exec-10 [12:40:40] Coren: What flags? [12:40:46] i can now [12:41:05] Coren: was it thrashing in the traditional sense (oom, swap) [12:41:16] YuviPanda: yes; it's hitting swap hard. [12:41:21] hmm [12:41:33] I suppose the proposed memory check would've caught it [12:41:57] YuviPanda: Yeah, I had 1G memory for each process. [12:42:21] Coren: What flags to find the parameters? [12:42:24] a930913: qstat -j ; look for the job_args line [12:43:08] * Coren wonders how the box even got in that state to begin with. [12:43:20] a930913: Did you schedule all those jobs at once? [12:43:48] Coren: Yeah, a script dumped about 500 of these into the queue. [12:44:23] Ah. That's what must have happened then; the scheduler had a lot of room on -10; and didn't realize that starting them all at once would prevent it from seeing how heavy they are. Silly scheduler. [12:45:05] Ok, jobs deleted. [12:45:12] Manual reschedule in process :p [12:45:40] * Coren isn't sure how that can be prevented. [12:45:55] a930913: pace queuing them up a bit; give -10 a chance to recover. [12:46:51] a930913: Oh, ouch. That was it. Those jobs took over 7G of /physical/ ram. [12:47:01] Coren: Each one was a gig. [12:47:32] Yeah, the scheduler really shouldn't have tried to start them all at once on the same box. If it hadn [12:47:40] hadn't, it would have noticed the load. [12:48:57] Coren: How many RAMs do the nodes have? [12:48:59] * Coren ponders. [12:49:20] a930913: 8G for most. [12:50:06] I'll need to tweak the startup load bias. [12:50:37] Coren: Should I wait for that before submitting the ten or so again? [12:51:08] a930913: If you wait 30s between submissions it should be okay regardless. [12:57:36] YuviPanda: Looks like the backup thing works but I'm going to turn off the "back up home" by default. Some users' homes are... unreasonably full. [12:58:43] Coren: So the reason I was using so much RAM, was because it was faster to take the INFILE onto each process and use it locally, than polling the database. Is this silly? [13:00:10] a930913: Not necessarily, and that's going to be much less painful on the DB (which is not a bad thing), but you may want to reduce how many you do in parallel then - since it's faster to do you don't loose much by doing so. [13:01:21] Coren: heh [13:01:23] Coren: are you backing up tool homes? [13:01:26] Or, at least, pace the rythm at which you schedule them. [13:01:36] YuviPanda: Yes, by default, though it didn't reach that yet. [13:01:43] right [13:02:47] Coren: I have about 500 files to process, so I was dumping them into the scheduler on the basis that you said the scheduler would schedule, but seeing as it doesn't, I'll add some sleepytime in :p [13:03:11] It schedules; it's just a little overoptimistic. :-) [13:05:15] :) [13:14:57] !log shinken added yuvipanda to project to mess around [13:14:59] Logged the message, Master [13:25:59] !log analytics stopping mysql and puppet on wikimetrics1 to perform /srv surgery [13:26:01] Logged the message, Master [13:29:55] !log analytics enable /srv role on wikimetrics1, running puppet [13:29:57] Logged the message, Master [13:30:35] ooh, surgery [13:30:51] !log analytics kill old diamond archive logs on wikimetrics1 so /var has enough space for puppet to run [13:30:52] Logged the message, Master [13:32:49] !log analytics kill more logs on wikimetrics1 so apt-get has enough space to run [13:32:51] Logged the message, Master [13:50:40] !log analytics fix apparmor + my.cnf to refer to new datadir [13:50:42] Logged the message, Master [13:56:02] 3Wikimedia Labs / 3deployment-prep (beta): hhvm fill up /var/log/upstart/hhvm.log - 10https://bugzilla.wikimedia.org/69976#c3 (10Antoine "hashar" Musso) I think puppet is now passing on the hhvm instances. I am not sure where the log are written to though. [14:34:08] andrewbogott: ping? [14:34:59] Hrm. Puppetmaster seems broken for ~1000 minutes [14:45:56] andrewbogott: Coren I can't log into newly created instances, with public key denied. NFS acting up? [14:46:31] YuviPanda: there seems to be an issue with the puppetmaster atm, seemingly with the web service proper. Puppet runs won't work. [14:46:44] ah, that'd explain things [15:02:22] Coren: Can I safely migrate tools-exec-03? All grid jobs moved off now? [15:03:14] andrewbogott: Now, because puppet can't run on the new nodes (or on any instance atm). I didn't want to mess with the puppetmaster because the problem seems to be web-server-side and I know you guys have been messing with it. [15:03:23] s/Now,/No,/ [15:03:36] ok... [15:03:37] andrewbogott: That was the reason for my earlier ping. [15:03:51] It's the virt1000 puppet master? [15:03:58] * andrewbogott looks [15:04:10] andrewbogott: Looks like. Puppet gets 400s when it tries to fetch its calalogs. [15:05:07] So, broke 18 hours ago… curious [15:22:37] andrewbogott: any update on the puppetmaster? [15:22:44] Still looking [15:23:10] cool [15:27:15] 3Wikimedia Labs / 3deployment-prep (beta): Setup monitoring for Beta cluster - 10https://bugzilla.wikimedia.org/51497 (10Antoine "hashar" Musso) [15:29:25] !log analytics cherry-picked https://gerrit.wikimedia.org/r/#/c/160464 on wikimetrics1 [15:29:27] Logged the message, Master [15:43:44] 3Wikimedia Labs / 3tools: Expose revision.rev_content_format on replicated wikidatawiki - 10https://bugzilla.wikimedia.org/54164#c6 (10Marc A. Pelletier) Now that this schema change has been propagated it's possible to do cleanly. [15:53:04] For folks following along at home: The labs puppet master is failing. The issue is diagnosed and a few solutions are in discussion, a fix should be in place shortly. [16:35:23] andrewbogott: if puppetmaster is cleared up, can you log in to shinken-01 with your root key and force a run? [16:35:44] YuviPanda: I'm not totally happy with virt1000 yet, but I'll try to remember to do that next [16:35:48] ah ok [16:42:09] YuviPanda: I can't get into shinken-01 but I bet if you reboot it it'll cheer up [16:42:18] let me reboot [16:42:34] And virt1000 is still hanging during puppet run :( [16:42:39] rebooted [16:44:01] Coren: What does tools-db look like atm? [17:05:25] a930913: I've never actually seen it, but I'm pretty sure it looks like a server in a rack. It probably has blinkenlights too. :-) [17:07:01] It's busy, unsurprisingly. That thing is rarely idle. Whyfor? [17:09:56] ok, labs puppet issues should be resolved. [17:10:11] * Coren tries [17:10:27] I can't say the same for the rest of the prod cluster :( [17:10:46] Puppet can haz katalog! [17:11:35] YuviPanda: is shinken-01 happy now? [17:11:42] andrewbogott: trying [17:12:01] andrewbogott: seems to be. [17:12:02] thanks [17:12:10] cool [17:13:00] Coren: I have two queries running on it, for 3 and 6 hours. The 3 hour one I was expecting to finish over an hour ago. [17:13:28] a930913: It's running pretty much at 100% use atm; so you're sharing resources with everyone else. [17:15:00] Coren: :( [17:16:34] andrewbogott: With puppet in place, I replaced both exec nodes with the newer ones. I'll give it a bit for non-continuous jobs to drain then relocate the continuous ones. [17:19:05] andrewbogott: tools-submit, otoh, is a bit more precious. I'm doing a backup of the important things now. [17:19:41] Coren: why the switching over? [17:20:01] YuviPanda: One of the virt* hosts is suspect; we want to reimage it. [17:20:08] ah [17:20:12] so moving things off it [17:20:15] cool [17:20:16] * Coren nods. [17:44:11] Coren: Warning: mysqli::query(): Empty query in /data/project/quentinv57-tools/public_html/tools/sulinfo.php on line 207 [17:44:11] SUL info [17:44:19] can you please help with this ? [17:44:33] it was working eaerlier today [18:07:30] 3Wikimedia Labs / 3wikitech-interface: [Regression] WMFLabs: Nova project quota broken - 10https://bugzilla.wikimedia.org/70634#c3 (10Krinkle) 5NEW>3RESO/WOR I can't tell for certain as I did a hard browser reset. The new login session shows none of these bugs. [18:07:30] 3Wikimedia Labs / 3wikitech-interface: [Regression] WMFLabs: Unable to delete any instance - 10https://bugzilla.wikimedia.org/70636 (10Krinkle) 5UNCO>3RESO/WOR [18:08:36] YuviPanda: Is graphite labs still maintained? [18:08:41] I dont see "integration" in the list [18:08:50] Krinkle: graphite.wmflabs.org should have everything [18:09:08] Antoine gave our Jenkins slaves a dedicated puppet master though [18:09:09] Krinkle: hmm, is puppet enabled there/ [18:09:10] ? [18:09:11] maybe I need to backport something? [18:09:12] ah [18:09:14] yup [18:09:20] I suggest git pull -r origin production? [18:09:24] there have been quite a few changes [18:12:40] YuviPanda: last update was 4 weeks ago [18:17:39] Krinkle: might've missed some, I think [18:20:39] YuviPanda: Hm.. I see lots of /etc/ changes in the run I did after rebasing [18:20:53] diamond might be one [18:20:55] nscd.conf [18:21:10] and ensure packages / ordered_json stuff [18:21:26] Error: /Stage[main]/Role::Labs::Instance/Mount[/public/dumps]: Failed to call refresh: Execution of '/bin/mount -o remount /public/dumps' returned 32: [18:21:26] Error: /Stage[main]/Role::Labs::Instance/Mount[/public/dumps]: Execution of '/bin/mount -o remount /public/dumps' returned 32: [18:21:32] Krinkle: ignorable error [18:21:48] /etc/salt/minion [18:25:49] /etc/ferm/conf.d/00_defs [18:30:46] !log integration Delete the experimental integration-slave1005 instance [18:30:50] Logged the message, Master [18:34:12] !log integration Create and set up pool of Jenkins slaves with Trusty (integration-slave1006, integration-slave1007, integration-slave1008); bug 68256 [18:34:15] Logged the message, Master [18:40:57] andrewbogott: FYI: tools-exec-{03,07} are now off-queue and empty. Feel free to migrate them. [18:41:12] Coren: Great! I will start breaking things post-meeting [18:44:28] Coren: More wild exec nodes appears? [18:44:57] a930913: Two slightly bigger ones to replace the ones we are about to move away. They *should* survive, but we wanted to avoid downtime and/or risk. [18:48:07] andrewbogott: can you create a 'shinken' user for me in LDAP? [18:48:24] Coren: ^ [18:48:33] YuviPanda: yes, remind me post-meeting [18:48:41] andrewbogott: will do [18:50:26] Krinkle: I see integration machines on graphite now [18:58:01] 3Wikimedia Labs / 3deployment-prep (beta): ferm policy on deployment-bastion prevents scap rsync from mw hosts - 10https://bugzilla.wikimedia.org/70858 (10Antoine "hashar" Musso) 3NEW p:3Unprio s:3normal a:3None The deployment-bastion.eqiad.wmflabs has ferm enabled. A changed occurred at 17:20 UTC w... [18:59:29] hashar: do you need help with ferm ? [19:00:15] 3Wikimedia Labs / 3deployment-prep (beta): ferm policy on deployment-bastion prevents scap rsync from mw hosts - 10https://bugzilla.wikimedia.org/70858 (10Greg Grossmeier) p:5Unprio>3Highes [19:00:49] * YuviPanda gives andrewbogott a post meeting poke [19:00:56] matanya: sure :) [19:01:13] hashar: hit me! :) [19:01:14] matanya: the mediawki instance of the beta cluster can no more ssh to the central deployment-bastion instance [19:01:40] matanya: which is needed for scap (the deployment tool). But I have no idea whether we allowed such access in the first place, might have been hacked manually [19:01:44] that looks like the change i abadoned ... :D [19:01:54] iirc [19:01:58] let me look into this [19:02:07] or maybe some ferm::rule is no more being applied [19:02:24] as you know, i don't have access to the hosts [19:02:31] but i can look on puppet [19:03:08] oh you can get access on them :-] [19:03:12] root would need a NDA though [19:04:09] hashar: i have nda [19:05:25] hashar: Did you do anything with the new 1006-8 instances? [19:05:30] hashar: you can check him on LDAP actually :) [19:05:31] Krinkle: nothing [19:06:37] hashar: k. strange. [19:06:54] hashar: integration 6 and 7 have a puppet ca error (as is normal). but integartion 8 does not [19:07:01] https://wikitech.wikimedia.org/wiki/Nova_Resource:Integration/Setup [19:07:13] That worries me [19:07:26] maybe the puppetmaster change didn't apply? How can I see what the puppetmaster is? [19:09:02] matanya: will look at it tomorrow. child care going on :p [19:09:41] Krinkle: maybe 8 does not loint to our puppetmaster [19:09:51] hashar: sure, production issues for me, ttyl [19:10:14] Hm.. took a few minutes longer [19:10:17] it shows up in ca list now [19:10:19] I've signed in [19:10:22] Krinkle: we can pair tomorrow if you want. busy with kid :p [19:10:24] OK. I'll continue withi my checklist [19:11:00] out sorry [19:12:39] YuviPanda: user shinken should work now [19:14:01] Coren: Is there any way we can see these usage stats? [19:26:23] a930913: ... of? [19:28:00] Coren: Such as the 100% use of the db. [19:31:55] a930913: Not trivially. That DB is real metal and the stats are on graphite.wikimedia.org - and I'm pretty sure you need an NDA to get there. [19:34:10] http://noc.wikimedia.org/dbtree/ ? [19:37:02] Coren: Can't the nonconfidential stuff be copied somewhere accessable? [19:43:10] Krinkle: more or less around :) [19:48:27] mutante: Which one is tools-db? [19:53:09] a930913: mutante: the labs replicas aren't on there (maybe they should be, actually) [19:55:02] Coren: having labsdb on graphite.wmflabs.org would be nice [19:55:11] no need to change firewall groups either [19:55:16] just need to add an extra diamond handler [19:57:20] andrewbogott: ty for the user [19:57:41] Coren: a930913 if you want labsdb stats on graphite.wmflabs.org, do put setup a bugzilla bug, and assign it to me [19:59:16] andrewbogott: hmm, if we add a user in ldap, wouldn't it automatically come up in instances? [19:59:21] or do I still need to add a system user? [19:59:59] YuviPanda: It's not in any projects, so won't appear as a member of anything. Generally you'd want to create a system user I think. [20:00:08] hmm, ok [20:00:14] I'm not sure how this is handled generally, but if you look at 'people' in ldap you can find some examples. [20:00:27] I wonder how it was handled for labs-vagrant [20:01:12] YuviPanda: I don't have a bug account, I bug people here :p [20:01:44] 3Wikimedia Labs / 3deployment-prep (beta): Search and page loads extremely slow on beta cluster (cause being investigated) - 10https://bugzilla.wikimedia.org/70103#c9 (10Rummana Yasmeen) 5RESO/FIX>3REOP This is again happening. [20:03:17] andrewbogott: yup, it's automatically created. shinken's install is broken, let me figure out... [20:03:44] 3Wikimedia Labs / 3deployment-prep (beta): Search and page loads extremely slow on beta cluster (cause being investigated) - 10https://bugzilla.wikimedia.org/70103 (10Rummana Yasmeen) 5REOP>3ASSI [20:04:12] YuviPanda: know anything about the 'gerrit-dev' instance? [20:04:19] andrewbogott: nope [20:04:25] Looks like it was a victim of virt1006, in a 'deleting' state but still running, sort of [20:04:43] quarry-web-test is mine, and I'd rather be around when you move it (I'll have to go now in a bit tho) [20:04:47] hm… qchris, 'gerrit-dev'? That yours? Or ^d? [20:06:19] andrewbogott: Not sure. If it's mine, I do not use it any longer. [20:08:08] andrewbogott: do you need to add the default group for that user specifically in LDAP? [20:08:13] or does adding a user autocreate a group? [20:08:34] YuviPanda: pretty sure the default group is there automatically [20:09:04] andrewbogott: hmm, how do I check? [20:09:12] the shinken install script is complaining that there's no such group [20:09:29] 3Wikimedia Labs / 3deployment-prep (beta): Setup monitoring for Beta cluster - 10https://bugzilla.wikimedia.org/51497 (10Antoine "hashar" Musso) [20:09:33] 3Wikimedia Labs / 3deployment-prep (beta): monitor unsigned salt keys - 10https://bugzilla.wikimedia.org/70862 (10Antoine "hashar" Musso) 3NEW p:3Unprio s:3normal a:3None Whenever an instance is added to the beta cluster and switched to the local salt master, we might forget to sign the key on the sa... [20:09:46] YuviPanda: oh, if you're creating the system user then I think you also need to create a group. [20:09:54] andrewbogott: I didn't create the system user [20:09:54] But I have no idea, really, need to find an example in puppet [20:09:56] it was already there [20:10:08] andrewbogott: I checked, and just creating it in ldap automagically got it there. [20:10:14] hm [20:10:17] it's the group that is missing [20:10:20] are groups also managed in LDAP? [20:12:32] Yes [20:12:39] I may need to create a group, lemme see if I can [20:12:50] ok [20:15:19] YuviPanda: is that any better? [20:15:31] checking [20:15:48] andrewbogott: yup, ty [20:16:23] bd808: I need to reboot bastion2, do you mind closing your sessions there? [20:16:46] 3Wikimedia Labs / 3deployment-prep (beta): salt commands exceed timeout - 10https://bugzilla.wikimedia.org/70863 (10Antoine "hashar" Musso) [20:16:46] 3Wikimedia Labs / 3deployment-prep (beta): ferm policy on deployment-bastion prevents scap rsync from mw hosts - 10https://bugzilla.wikimedia.org/70858 (10Antoine "hashar" Musso) [20:16:46] 3Wikimedia Labs / 3deployment-prep (beta): salt commands exceed timeout - 10https://bugzilla.wikimedia.org/70863 (10Antoine "hashar" Musso) 3NEW p:3Unprio s:3normal a:3None deployment-bastion has a bunch of idling python commands. When running puppet I noticed some timeout being exceeded: Error: /St... [20:17:01] andrewbogott: Just reboot it. I'm not doing anything important through there right now [20:17:13] bd808: ok, thank you [20:17:28] alright, am off. more shinkening tomorrow [20:17:30] * YuviPanda waves [20:39:07] !log phabricator - created project, added members qgil,aklapper,yuvipanda,dzahn [20:39:09] Logged the message, Master [21:00:22] andrewbogott: You are doing -03? [21:00:45] Coren: -03 didn't survive the operation I'm afraid :( [21:01:07] I may be able to revive it, let me try [21:01:18] andrewbogott: How sad. Did its sacrifice give you enough data to make the others live? [21:01:27] It did! 07 seems fine. [21:01:56] Oh, 07 is moved? It does indeed seem fine: I didn't know you had even started. :-) [21:02:44] 07 is on virt1005 now [21:02:59] I'm trying 03 again, we'll see I can get the pieces back together [21:03:10] andrewbogott: Don't spend more than a couple minutes trying to revive -03; if it's not easy I can just rebuild it. [21:03:48] Want me to prep -submit for the move? I want to back up a thing or two off it first just in case. [21:04:47] * Coren puts -07 back in the queue. [21:06:51] andrewbogott: -submit is ready to be moved whenever you feel like it. [21:10:10] Coren: I think -03 is back now... [21:10:24] andrewbogott: Look like! [21:10:27] * Coren applauds. [21:10:40] I moved 07 and 03 both to virt1005. Is it ok if -submit goes there too, or is that too many eggs in a basket? [21:11:19] andrewbogott: Should be okay if there is room. -submit in particular is very lightweight. There are 15 exec nodes now, having a couple together is not an issue. [21:11:40] ok. Ready for 10 min downtime on -submit? [21:11:45] * Coren nods. [21:12:46] 3Wikimedia Labs / 3deployment-prep (beta): sudo rights for matanya - 10https://bugzilla.wikimedia.org/70864 (10matanya) 3NEW p:3Unprio s:3normal a:3None Hello I would like to get sudo rights on beta project in order to help debug puppet/firewall and other issues. I have signed an NDA, can verify wit... [21:15:29] 3Wikimedia Labs / 3deployment-prep (beta): sudo rights for matanya - 10https://bugzilla.wikimedia.org/70864#c1 (10jeremyb) no need for ops: $ groups matanya | perl -lpe 's/\s+/\n/g;' | fgrep -x -e project-deployment-prep -e nda project-deployment-prep nda [21:19:34] !log deployment-prep Added Matanya to under_NDA sudoers group (bug 70864) [21:19:38] Logged the message, Master [21:19:45] 3Wikimedia Labs / 3deployment-prep (beta): sudo rights for matanya - 10https://bugzilla.wikimedia.org/70864#c2 (10Bryan Davis) 5NEW>3RESO/FIX p:5Unprio>3Normal a:3Bryan Davis $ ldapsearch -x cn="Matanya" \* + |grep nda isMemberOf: cn=nda,ou=groups,dc=wikimedia,dc=org Easy decision. Welcome to the... [21:19:46] matanya: ^ [21:19:51] Thanks! [21:21:43] Coren: ok, you can repool tools-submit now. [21:22:08] Only 60 instances left to go! [21:25:14] 3Wikimedia Labs / 3deployment-prep (beta): Search and page loads extremely slow on beta cluster (cause being investigated) - 10https://bugzilla.wikimedia.org/70103#c10 (10Rummana Yasmeen) Its slow while searching for pages, link target names, media files etc but page loading is fine at my end. [21:29:35] andrewbogott: tools-submit isn't an exec node, all it does is run crond. [21:29:56] andrewbogott: Which is why the only safety measure I took against you breaking it is back the crontabs up. :-) [21:30:09] !log phabricator - created instance, created puppet group, added role::phabricator::labs to new group [21:30:12] Logged the message, Master [21:30:12] It also has a login message saying that it's about to be shut down :) [21:30:29] YuviPanda|zzzz: ^ eh, dunno how anyone used that before when there was not even a group for it [21:30:46] andrewbogott: Yeah, I just removed the /etc/nologin [21:30:49] (hint: i think we didnt"P) [21:40:22] !log deployment-prep *skipped* deploy of OCG, due to deployment-salt issues [21:40:26] Logged the message, Master [21:41:34] !log deployment-prep migrating deployment-sentry2 to virt1002 [21:41:37] Logged the message, dummy [21:44:04] !log deployment-prep migrating deployment-videoscaler01 to virt1002 [21:44:07] Logged the message, dummy [21:46:32] 3Wikimedia Labs / 3deployment-prep (beta): deployment-salt can't talk to itself, git deploy hangs - 10https://bugzilla.wikimedia.org/70868 (10C. Scott Ananian) 3NEW p:3Unprio s:3normal a:3None cscott-free: git deploy sync is hanging (no console output at all) on beta. bd808: lame. I wonder if salt is... [21:48:31] Danny_B: I need to relocate your two instances, dannyb and dannyb_large. Can you confirm that they're safe to reboot? [21:48:59] 3Wikimedia Labs / 3deployment-prep (beta): deployment-salt can't talk to itself, git deploy hangs - 10https://bugzilla.wikimedia.org/70868#c1 (10Bryan Davis) On deployment-salt: $ salt '*' cmd.run hostname i-00000396.eqiad.wmflabs: deployment-pdf01 i-00000504.eqiad.wmflabs: deployment-math... [21:50:29] 3Wikimedia Labs / 3deployment-prep (beta): deployment-salt can't talk to itself, git deploy hangs - 10https://bugzilla.wikimedia.org/70868#c2 (10Bryan Davis) I tried restarting the salt-master process on deployment-salt and the salt-minion process on deployment-bastion and this didn't seem to help anything. [22:16:13] !log phabricator - configured instace phab-01 to use role::phabricator::labs [22:16:16] Logged the message, Master [22:24:08] Nikerabbit: I'd like to reboot some instances in the 'language' project: language-browsertests, language-lcmd, language-mleb-legacy. Any thoughts? Safe? [22:24:44] 3Wikimedia Labs / 3deployment-prep (beta): Security test load caused search and page loads extremely slow on beta cluster - 10https://bugzilla.wikimedia.org/70103#c11 (10Greg Grossmeier) 5ASSI>3RESO/FIX I just sat next to Rummana to see the symptoms. There a bit sporatic but noticable. I'll open a new b... [22:26:31] 3Wikimedia Labs / 3deployment-prep (beta): Search is sometimes slow - 10https://bugzilla.wikimedia.org/70869 (10Greg Grossmeier) 3NEW p:3Unprio s:3normal a:3None Rummana saw the issue described in bug 70103 again. The search requests (either in the drop down on the top right, or within VE) are somet... [22:26:59] 3Wikimedia Labs / 3deployment-prep (beta): Search is sometimes slow - 10https://bugzilla.wikimedia.org/70869#c1 (10Greg Grossmeier) p:5Unprio>3Normal (setting normal for now, but if it starts causing browser test failures or otherwise, we'll bump it up) [22:28:59] 3Wikimedia Labs / 3deployment-prep (beta): Search is sometimes slow - 10https://bugzilla.wikimedia.org/70869#c2 (10Greg Grossmeier) Better graph: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1410820101.127&from=-6hours&target=deployment-prep.deployment-elastic01.loadavg.01.value&target=depl... [22:31:15] 3Wikimedia Labs / 3deployment-prep (beta): Search is sometimes slow - 10https://bugzilla.wikimedia.org/70869#c3 (10Greg Grossmeier) Bah, those graphs are relative time based (last 6 hours) and will change. Here's a static one for today from 17:30 - 23:30 UTC: http://graphite.wmflabs.org/render/?width=586&hei... [22:40:40] !log integration migrating integration-slave1002 and integration-slave1007 to virt1002 [22:40:43] Logged the message, dummy [22:59:38] andrewbogott: Hm.. normalisation? https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:I-000001c1.eqiad.wmflabs&diff=prev&oldid=109194 [22:59:59] Even for one that was created earlier today https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource%3AI-000005cd.eqiad.wmflabs&diff=127095&oldid=127015 [23:00:20] Did that update just then or is it two separate processes writing it in a different way (so it always happens) [23:01:19] I just moved them to a new host. [23:01:23] That causes a status edit [23:02:00] Yeah, but the status updator seems to read info in a different format causing a dirty diff [23:02:07] andrewbogott: Is the move done? [23:02:23] 1002 is still being copied. Should be done shortly [23:02:48] Does the move persist and running processes? Or does it kill and reboot? [23:02:55] reboot [23:03:01] Both 1002 and 1007 unexpectedly started failing in Jenkins: https://integration.wikimedia.org/ci/computer/ [23:03:29] Sorry, I should've warned. I assumed (wrongly?) that jenkins would cope with a brief slave shortage. [23:03:32] and the jenkins slave connection does not start at reboot, needs to be done manually. So don't move any more as that'll starve the pool if you don't repool them [23:03:44] It coops with connection dropping [23:03:45] They're the only ones I need to move. [23:03:58] 1002 is rebooting now, 1007 should be up already [23:04:41] andrewbogott: Can you try logging into https://integration.wikimedia.org/ci/computer/integration-slave1007/ and see if you have rights to launch the slave connection? [23:05:13] into jenkins web UI that is (LDAP creds) [23:07:05] Launching [23:10:36] Krinkle: I don't know if it's your bag or not, but puppet is broken on all of those slaves. [23:12:37] andrewbogott: yeah, though it should be continuing [23:12:41] there's 2 known failures [23:12:52] 3 [23:13:00] on all labs instances public-mount is broken (Yuvi told me that's OK) [23:13:00] anyway, those two slaves are up and running now [23:13:12] on integration slaves with trusty a php package is missing. [23:13:50] which is why I've added the new trusty slaves only to the npm-test pool, not the phpunit pool. [23:16:17] andrewbogott: thx, yeah, pool is moist again [23:27:56] all the SUL related tools are failing >?