[00:37:44] 6Labs, 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration, 10OOjs, 6operations: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1115471 (10scfc) I don't think so because that was merged earlier. But on March 6th https... [02:10:34] 6Labs, 10Wikimedia-Labs-Infrastructure, 10Continuous-Integration, 10OOjs, 6operations: Jenkins failing with "Error: GET https://saucelabs.com: Couldn't resolve host name." - https://phabricator.wikimedia.org/T92351#1115608 (10coren) The only net effect the change can make is that //iff// the fqdn has exa... [04:02:10] 10Tool-Labs, 10Living-Style-Guide, 6Mobile-Web: npm version on tools-login.wmflabs.org is incompatible with MobileFrontend package.json used by the KSS styleguide - https://phabricator.wikimedia.org/T89093#1115717 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Use tools-trusty.wmflabs.org for newer version... [06:37:02] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [06:39:17] now what [06:39:57] aaaah, logrotate. [06:42:10] 6Labs, 10Beta-Cluster, 10Staging: Provide option to autosign puppet certs for self hosted puppetmasters - https://phabricator.wikimedia.org/T92606#1115838 (10yuvipanda) 3NEW [06:56:23] PROBLEM - Puppet failure on tools-exec-12 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [0.0] [07:01:51] Once again, https://tools.wmflabs.org/kmlexport/?project=de&article=Alzkanal&redir=bing is down. [07:03:44] sigh [07:03:56] ergtheee: so it gets restarted about thrice a day, and then it isn't [07:04:01] so it doesn’t get stuck in a restart loop [07:04:21] ergtheee: another problem is that it doesn’t seem to work with newer versions of… perl? [07:04:30] ergtheee: do you know who the maintainers are? [07:05:34] ergtheee: I’ve started it again [07:05:44] It is Para: https://wikitech.wikimedia.org/w/index.php?title=User_talk:Para&diff=146964&oldid=146961 [07:05:50] Thanks for restarting [07:06:53] ergtheee: I shall leave them a message [07:07:00] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [07:11:20] 6Labs, 5Patch-For-Review: New disk partition scheme for labs instances - https://phabricator.wikimedia.org/T87003#1115872 (10yuvipanda) [07:17:37] 10Tool-Labs-tools-Other, 7Tracking: [tracking] toolserver.org tools that have not been migrated - https://phabricator.wikimedia.org/T60865#1115881 (10TheDJ) [07:20:19] 10Tool-Labs-tools-Other, 7Tracking: [tracking] toolserver.org tools that have not been migrated - https://phabricator.wikimedia.org/T60865#1115886 (10TheDJ) [07:20:20] 10Tool-Labs-tools-Erwin's-tools: Migrate https://toolserver.org/~erwin85/blockfinder.php to Tool Labs - https://phabricator.wikimedia.org/T62880#1115883 (10TheDJ) 5Open>3Resolved a:3TheDJ I fixed this one a while ago. [07:21:22] RECOVERY - Puppet failure on tools-exec-12 is OK: OK: Less than 1.00% above the threshold [0.0] [09:30:05] 10Tool-Labs, 10Wikimania-Hackathon-2015: Conduct a Tool Labs Workshop in Wikimania hackathon - https://phabricator.wikimedia.org/T91061#1116030 (10Qgil) Only in Wikimania or also in the #Wikimedia-Hackathon-2015 ? See {T92274} [09:33:35] 6Labs, 10Wikimedia-Hackathon-2015: Organize Wikimedia Labs activities at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92274#1116032 (10Qgil) >>! In T92274#1104489, @Qgil wrote: > Can you nominate a main contact for the Wikimedia Labs area, please? We are doing the same for the other main a... [09:35:18] 6Labs, 10Wikimedia-Hackathon-2015: Organize Wikimedia Labs activities at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92274#1116035 (10yuvipanda) Hello! Apologies for the delayed response. @Coren @andrew are you ok with me volunteering to be 'main contact'? [09:36:55] 6Labs, 10Wikimedia-Hackathon-2015: Organize Wikimedia Labs activities at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92274#1116036 (10yuvipanda) >>! In T92274#1116032, @Qgil wrote: >> Also, I wonder whether we could mobilize the #Tool-Labs community, and whether they would be part of this... [09:38:12] 10Tool-Labs, 10Wikimania-Hackathon-2015: Conduct a Tool Labs Workshop in Wikimania hackathon - https://phabricator.wikimedia.org/T91061#1116039 (10yuvipanda) Yes, I think we should do both. Wikimania attracts a more diverse set of users (a lot more 'end users!') and I think a workshop here would be of a differ... [10:19:25] (03PS1) 10Yuvipanda: Add more projects for devtools and mobile [labs/tools/wikibugs2] - 10https://gerrit.wikimedia.org/r/196551 [11:15:59] PROBLEM - Free space - all mounts on tools-trusty is CRITICAL: CRITICAL: tools.tools-trusty.diskspace.root.byte_percentfree.value (<22.22%) [11:24:27] Coren, Cc: YuviPanda: getting labstore1001 port saturation alerts; labstore1001 doesn't seem to be in Ganglia, again... *sigh* [11:24:32] Coren, Cc: YuviPanda: https://ganglia.wikimedia.org/latest/graph.php?r=year&z=xlarge&h=labstore1001.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=network_report&c=Labs+NFS+cluster+eqiad [11:25:08] sigh [11:25:11] there’s graphite data, I think [11:25:12] * YuviPanda looks [11:25:27] there is [11:25:32] I guess it’s the unpuppetized bond0 [11:25:43] https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1426245937.7&target=servers.labstore1001.network.bond0.rx_byte.value&target=servers.labstore1001.network.bond0.tx_byte.value [11:25:59] alrightn, now to find the culprit. [11:28:34] 6Labs, 6operations: Network port saturated for labstore1001 - https://phabricator.wikimedia.org/T92614#1116297 (10yuvipanda) 3NEW [11:30:23] 6Labs, 6operations: Puppet failure on labstore1001 - https://phabricator.wikimedia.org/T92615#1116306 (10yuvipanda) 3NEW [11:56:35] YuviPanda: Hi [11:56:42] hi Vivek [11:56:59] YuviPanda: Is wikimedia labs looking for volunteers ? [11:57:08] indeed :D all the time :) [11:57:32] Can I volunteer ? [11:57:43] I can help out with OpenStack and Puppet. [11:57:51] YuviPanda: biggest user is 10.68.17.144, which has no reverse DNS [11:58:07] wait, it does, but only on labs-ns1 and not labs-ns0? [11:58:08] wth? [11:58:23] so labs-ns0 doesn't really work I guess [11:58:28] paravoid: so I’ve been poking around nethogs for a while. looks like it’s coming from the tools exec nodes but not all of them [11:58:39] paravoid: http://graphite.wmflabs.org/render/?width=586&height=308&_salt=1426246930.035&target=tools.tools-exec-*.network.eth0.rx_byte.value&target=tools.tools-exec-*.network.eth0.tx_byte.valuefrom=-3hours [11:58:46] rx and tx for the exec nodes [11:58:47] yes, 11 & 14 [11:58:52] YuviPanda: Do you still work for Wikimedia Foundation ? [11:58:56] Vivek: I do :) [11:59:00] Ok. [11:59:07] Vivek: I’m debugging an NFS saturation, can you give me 15-20 mins? [11:59:24] paravoid: it flips if you look long enough (used to be 07 and 08 when I started looking) [11:59:29] I’m trying to find the offending script [12:00:10] hcclab? [12:00:41] YuviPanda: I moved to Tech Mahindra as a Technical Architect/Project Manager. I am in Chennai now. Working with OpenStack and Ceph mostly. [12:01:01] paravoid: on which host? [12:01:05] tools.hcclab [12:01:18] 11 & 14 [12:01:22] holy shit [12:01:22] YuviPanda: Sure, please take your time. [12:01:22] tons of [12:01:23] 52455 29445 0.5 0.0 32164 1400 ? D 12:01 0:00 cat /data/project/hcclab/data/chunks/chunk-389.xml [12:01:25] yeah [12:01:30] let me kill them all [12:01:36] that being 500M [12:01:57] fwiw, iftop on labstore to find the network offender [12:02:02] then iotop on the VMs themselves [12:02:04] then ps [12:02:09] that's all I used, it took me ~5mins [12:03:03] paravoid: hmm, so I used nethogs, and then I looked at -07 and -08, but then there was nothing, and I went back and it had moved. [12:03:15] paravoid: so that person had submitted a *lot* of jobs, and they were being scheduled everywhere... [12:03:16] PROBLEM - Puppet failure on tools-exec-15 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0] [12:03:31] like a thousand or more jobs... [12:03:39] ouch [12:04:42] paravoid: I’ve killed them all [12:06:53] paravoid: I didn’t know about iftop, thanks :) I was installing and using nethogs... [12:07:14] paravoid: and network traffic back to normal [12:07:15] now [12:08:22] paravoid: also, where did you get the saturation warning? I didn’t get any, and I didn’t find any on icinga either [12:09:00] (and would like to get them) [12:10:09] 6Labs: Ganglia broken for labstore1001 (again) - https://phabricator.wikimedia.org/T92618#1116407 (10faidon) 3NEW a:3coren [12:11:13] 6Labs: Ganglia broken for labstore1001 (again) - https://phabricator.wikimedia.org/T92618#1116418 (10yuvipanda) (FWIW, I check graphite only, since that's what we have for labs machines, and graphite has been holding in data just fine) [12:11:55] YuviPanda: network-side alerts [12:12:16] i.e. the NMS alerted me that the switch has a saturated port [12:12:37] aaaah, right [12:12:49] I presume these don’t happen that often. [12:13:09] * YuviPanda wonders if he should be added to those [12:13:14] or alternatively, I can just setup an actual check [12:14:08] I have previously asked for a saturation-specific check for labstore1001, yes [12:14:13] it didn't happen [12:15:23] paravoid: I’ll do that later today. I have a feeling tools-db is also dead now. [12:16:36] hmm, seems ok [12:23:07] 6Labs, 6operations: Network port saturated for labstore1001 - https://phabricator.wikimedia.org/T92614#1116420 (10yuvipanda) 5Open>3Resolved a:3yuvipanda Turns out it was a tool that had started up a thousand or so jobs that all hit NFS, saturating everything. I've killed all the jobs, and will notify th... [12:23:11] RECOVERY - Puppet failure on tools-exec-15 is OK: OK: Less than 1.00% above the threshold [0.0] [12:31:04] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [12:38:52] well, no tools-db outage, so that’s good [12:43:06] YuviPanda: I am leaving office now. [12:43:23] Will be home in around a hour and half. [12:43:37] Talk to you later today. Bye for now. [12:43:58] Vivek: cool! do ping me :) [12:46:00] PROBLEM - Puppet failure on tools-login is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [12:49:03] PROBLEM - Puppet failure on tools-webproxy-jessie is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [12:53:45] 6Labs, 10Wikimedia-Hackathon-2015: Organize Wikimedia Labs activities at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92274#1116447 (10coren) @yuvipanda I'd be happy if you would. I'm going to have my plate very full this hackathon as I've got about a half-dozen volunteers joining me for... [12:54:39] 6Labs, 10Wikimedia-Hackathon-2015: Organize Wikimedia Labs activities at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92274#1116449 (10yuvipanda) Cool [12:54:50] hey Coren [12:56:51] RECOVERY - Puppet staleness on tools-webgrid-tomcat is OK: OK: Less than 1.00% above the threshold [3600.0] [12:57:53] Still in email/phab catchup + coffee. What be up? [12:58:44] Coren: we had another labstore1001 saturation event. me and paravoid tracked down the tool and killed it for now. however, we also found out that there’s no ganglia data for it again, and also there’s a puppet failure. [12:59:54] YuviPanda: Why in blazes would it stop collecting data when nothing changed? [13:01:06] RECOVERY - Puppet failure on tools-login is OK: OK: Less than 1.00% above the threshold [0.0] [13:01:22] Ah, I see the puppet failure. It's a typo that was fixed in the patch I made right after the previous one. I thought you'd have sent that one in. Clearly unrelated to ganglia though [13:05:34] 6Labs, 10Tool-Labs, 10Wikimedia-Hackathon-2015: Organize Wikimedia Labs activities at the Wikimedia Hackathon 2015 - https://phabricator.wikimedia.org/T92274#1116471 (10Qgil) a:3yuvipanda [13:05:53] I'll merge it as soon as paravoid confirms I can also merge his pending [13:07:34] YuviPanda: Interesting. "/Stage[main]/Ganglia/Service[ganglia-monitor]/ensure: ensure changed 'stopped' to 'running'" [13:07:47] YuviPanda: What could have stopped it previously? [13:07:50] * Coren checks logs. [13:08:58] YuviPanda: Aha. ganglia doesn't actuallt *start(; it crashes immediately. [13:09:05] gmond* [13:10:29] * Coren digs in further. [13:19:56] YuviPanda: I see what happens but not why. There are sysctl in /etc/sysctl.d that are correctly in place but not actually applied on boot by the os. dafu? [13:20:26] (In our specific case, net.core.rmem_max from /etc/sysctl.d/60-wikimedia-base.conf) [13:25:51] 6Labs, 10Beta-Cluster, 10Staging, 5Patch-For-Review: Provide option to autosign puppet certs for self hosted puppetmasters - https://phabricator.wikimedia.org/T92606#1116500 (10scfc) [13:29:45] 6Labs, 6operations: bond0 connection on labstore1001 is unpuppetized - https://phabricator.wikimedia.org/T92622#1116529 (10yuvipanda) 3NEW [13:29:51] 6Labs: labstore1001: sysctl settings improperly(?) applied on boot - https://phabricator.wikimedia.org/T92623#1116536 (10coren) 3NEW [13:32:32] 6Labs, 6operations: bond0 connection on labstore1001 is unpuppetized - https://phabricator.wikimedia.org/T92622#1116545 (10coren) Strictly speaking, it's commented out in site.pp atm because it is known to not work correctly because of a problem with the switch; the actual runtime configuration is not removed... [13:33:35] YuviPanda: We could remove the leftover configuration by hand from the box, but it's kinda pointless at this juncture. [13:49:19] I cannot log-in to wikitech.wikimedia.org :\ Anyone else facing this ? [13:49:57] 6Labs, 6operations: bond0 connection on labstore1001 is unpuppetized - https://phabricator.wikimedia.org/T92622#1116571 (10yuvipanda) Are you sure? eth0 and eth1 have no IP addresses on the machine, only bond0 does. [13:50:03] tonythomas: hey! [13:50:19] YuviPanda: hey ! [13:50:28] is it my issue - or something broke back there ? [13:50:28] tonythomas: try again now [13:50:40] okey. [13:51:05] yeah. done YuviPanda :) [13:51:13] tonythomas: \o/ cool [13:51:17] keystone was dead again [13:51:18] looks like something was broke :D [13:51:23] oh. [13:51:45] * tonythomas dont know what keystone is but considers it to be sth critical [13:51:53] restarted it, so all good now [13:52:02] right [13:52:38] I wanted to setup visualeditor in my local, but never got it working straight - some js alert coming up - so thought of using a labs instance [13:53:53] tonythomas: use mediawiki-vagrant! [13:54:08] tonythomas: and ask in #mediawiki-visualeditor. labs isn’t the right solution for this [13:54:39] YuviPanda: right. I have that one, let me see if its working [14:14:08] Coren: I’m unsure (re: bond0). I see neither eth0 nor eth1 have an IP [14:15:02] That's normal; you don't want the individual nics to have IPs. Right now, the best the switch could do was make a bonded single-port. :-) [14:15:40] right. either way, I think we should 1. document that and or 2. puppetize that. [14:17:20] YuviPanda: Puppetizing it, at this time, is kinda completely worthless. We really couldn't care less if the box reinstalled with the default of just eth0. I suppose we could just restore /etc/network/interfaces to the default of using eth0 now; but it won't change anything until a reboot and I fully expect that the next time labstore1001 is rebooted will be to reinstall it with Jessie. :-) [14:17:49] (Also I'd be worried if /etc/network/interfaces didn't match what ip addr told you. [14:18:35] I’ll admit to knowing only the vaguest things about bonded network interfaces :) [14:18:45] You can probably expound that on the bug and then close it... [15:18:01] andrewbogott: around? [15:18:09] YuviPanda: yep, what’s up? [15:18:14] andrewbogott: https://gerrit.wikimedia.org/r/#/c/196537/ [15:18:29] andrewbogott: I got rid of the —scriptuser to the autosigning script, so it uses the normal ldap user instead. [15:18:38] so self hosted puppetmasters can use it [15:18:45] do you know why we needed to use scriptuser? it isn’t doing any writes... [15:20:00] “FIXME: Above commit doesn't seem to reflect reality?” :) [15:20:13] 6Labs, 6operations, 7Monitoring: Setup alarms for labstore* to check for network saturatioin - https://phabricator.wikimedia.org/T92629#1116700 (10yuvipanda) 3NEW [15:20:50] YuviPanda: no idea, seems like it’s fine for it to run as root. [15:20:51] andrewbogott: it says ‘use a specific version for the checkout' [15:21:03] andrewbogott: and I see nothing about any specific version there... [15:21:06] YuviPanda: yeah, and then it doesn’t :) [15:21:16] yeah, hence the FIXME [15:21:21] maybe I should get rid of the comment instead [15:21:31] Oh, I think it’s referring to the ruby checkout above [15:21:44] oh? [15:21:46] mmmaybe [15:21:50] that at least specifies a version [15:22:09] andrewbogott: not really, it sets it to ‘latest' [15:22:21] hm [15:23:43] YuviPanda: —scriptuser is passed to the puppetsigner.py script yet I don’t see that referred to in that script. What am I missing? [15:24:40] andrewbogott: it gets passed all the way to ldapsupportlib [15:24:48] ah, I see. [15:24:49] hm [15:24:53] and it works without it? [15:38:10] YuviPanda: I can’t see a reason why —scriptuser is necessary, but after we merge let’s run a quick test on virt1000 to make sure [15:42:29] andrewbogott: sorry, was afk for a bit. [15:42:35] andrewbogott: yes, it does work when I tried. [15:42:43] andrewbogott: let’s merge and see if it craps out? :) [15:43:05] YuviPanda: do you mind removing that cryptic comment and cryptic metacomment? [15:43:13] hehe [15:43:15] no, let me do that now [15:52:05] YuviPanda: works fine for a new instance. [15:52:19] andrewbogott: \o/ wonderful. thank you [15:52:39] andrewbogott: I’m working on a small ENC now, hopefully will have a patch up in an hour or so... [15:52:48] useful only for self-hosted puppetmasters atm... [15:53:34] YuviPanda: fixing signing for self-hosted masters and the nfs-mounting race should make Timo happy. He was rebuilding a dozen ‘integration’ instances a couple of weeks ago and very sad about all the steps [15:54:04] andrewbogott: yeah. I ended up in the staging project and doing the same thing, and so am doing alll these things [15:55:49] Krenair: lots of goodies for people running self hosted puppetmasters. I’ll go document them all in a bit. [16:20:26] tools-trusty seems to have run out of space (for /tmp etc) [16:20:38] Filesystem 1K-blocks Used Available Use% Mounted on [16:20:38] /dev/vda1 9710768 9190948 3488 100% / [16:21:11] ah [16:21:12] also the link to open bugs in the topic still points to bugzilla [16:21:14] Coren: given an nfs mount like “labstore.svc.eqiad.wmflabs:/project/testlabs/home” is there a way to test whether or not I can access it w/out actually mounting something? [16:21:14] sitic: fixing. [16:21:22] thanks :-) [16:22:55] !log tools cleaned out / on tools-trusty [16:23:03] Logged the message, Master [16:24:31] andrewbogott: Well, lemme think. [16:24:49] google suggests rpcinfo but I’m not having much luck with that [16:25:08] btw, the above is wrong, it should be labstore.svc.eqiad.wmnet:/project/testlabs/home [16:26:54] andrewbogott: If you don't mind having to be root, showmount -e would work [16:27:35] root is ok [16:27:40] Hm. [16:27:43] *should* work. [16:27:53] # showmount -e [16:27:53] clnt_create: RPC: Program not registered [16:28:00] You need an ip. :-) [16:28:05] oh right [16:28:32] But also, the defaults won't go through the firewall. [16:28:41] * Coren tries to find a way around that. [16:29:08] this is for the firstboot script, I want it to sleep until the mount is available. [16:29:35] Ah. [16:29:51] will showmount error out if there are no exports? Or do I actually have to parse that output and see if my ip is in there? [16:30:27] andrewbogott: You'll have to parse the output, though the format is trivial enough [16:32:57] Also, from labs, it'd mean having to open a port or two. [16:34:02] Ah, no, we're already okay. Forgot that had to be open for nfs anyways. :-P [16:35:56] RECOVERY - Free space - all mounts on tools-trusty is OK: OK: All targets OK [16:37:49] Coren: so, maybe safe to just grep that whole blob for my IP, and if it’s in there we’re good? [16:39:42] of course, now I have to figure out my ip... [16:39:58] showmount -e labstore.svc.eqiad.wmnet | egrep ^/exp/project/$your_project\\s), | fgrep $your_ip, [16:40:07] Better yet, fgrep -q [16:40:14] This will be true iff the mount is available [16:42:29] um… egrep ^/exp/project/$your_project\\s) <- there’s a typo there someplace and I”m not sure where [16:43:35] ah, nm, think it’s working [16:56:23] andrewbogott: It can't be, I did a complete c-p fail. [16:56:46] I mean, it’s working after I removed the ) [16:56:47] echo $(showmount -e labstore.svc.eqiad.wmnet | egrep ^/exp/project/$your_project\\s), | fgrep $your_ip, [16:56:52] That is the correct one ^^ [16:57:17] the echo$( was missing at the beginning; that's where the ) comes from. :-) [16:57:41] what are those commas about? [16:58:17] This looks to be working: showmount -e $nfs_server | egrep ^/exp/project/$project\\s | fgrep -q $ip [16:58:45] The commas are to prevent '10.68.17.1' matching '10-.68.17.11' for instance [16:58:58] (without the stray dash) [16:59:16] hm, ok [16:59:40] Otherwise you'll get stray false positives about 4% of the time. [17:00:20] (Well, 100% of the time for about 4% of IPs) [17:06:33] Coren: https://gerrit.wikimedia.org/r/#/c/196233/ [17:07:52] andrewbogott: Why are you matching against 10.255.255.255? [17:08:07] ah, oops, because I was testing [17:08:12] Hah. :-) [17:10:56] It also beats a blind sleep 60. :-) [17:17:22] 6Labs: Setup a REST service that helps do NFS exports for OpenStack instances properly - https://phabricator.wikimedia.org/T92638#1116956 (10yuvipanda) 3NEW [17:19:16] 6Labs: Setup a REST service that helps do NFS exports for OpenStack instances properly - https://phabricator.wikimedia.org/T92638#1116963 (10coren) It may not need auth if all the service does is poke the check-ldap-for-new-stuff existing script rather than just have it wait in a loop. This way, the service can... [17:22:15] 6Labs: Setup a REST service that helps do NFS exports for OpenStack instances properly - https://phabricator.wikimedia.org/T92638#1116977 (10yuvipanda) Whee, that does simplify it a fair bit. It does make it dependent on LDAP, however. [17:22:29] 6Labs: Setup a REST service that helps do NFS exports for OpenStack instances properly - https://phabricator.wikimedia.org/T92638#1116978 (10yuvipanda) p:5Triage>3Low [17:25:06] 6Labs: Setup a REST service that helps do NFS exports for OpenStack instances properly - https://phabricator.wikimedia.org/T92638#1116996 (10coren) As things currently stand, LDAP is the source of authority anyways - unless we replicate the data on labstore itself (undesirable). [17:49:26] 10Tool-Labs: Unattended upgrades are failing from time to time - https://phabricator.wikimedia.org/T92491#1117115 (10scfc) Yesterday @BBlack merged https://gerrit.wikimedia.org/r/#/c/196162/ which adds some locking against concurrent Puppet agent runs (?), so if these warnings would go away, there could be a cor... [18:03:26] labstore1001 saturated again [18:04:18] paravoid: Ima hunt down the culprit. Who was it last night? (Good candidate) [18:04:32] hcclab [18:05:26] What the. [18:08:29] Same one. Trying to get attention of maintainer while he's online. [19:00:19] YuviPanda: How configurable are the shinken alerts? Can we prioritize them visibly? [19:01:00] Coren: fairly configureable, sure. what are you thinking of doing? [19:01:05] there’s a dashboard at shinken.wmflabs.org too [19:01:07] (guest/guest) [19:01:59] I want to be able to configure levels of alerts for my mobile devices. I use pushover, and I can filter in order to make some nosier than others but I'd rather not hardcode a list of things-you-want-to-be-very-noisy about locally. [19:03:14] i dunno about shinken, but for icinga/nagios there are several android apps that let customize alerts [19:03:19] Right now, everthing is notices (and quiet during off-hours) but I can do "don't be quiet during off-hours" and "omg make every device of mine do loud noises until I acknowledge it wake me up now!" [19:03:43] (The latter I would like for NFS woes) [19:04:49] have everything mailed to you and then use mail filters with something like [19:04:52] http://www.appszoom.com/android_applications/tools/mail-alert_impo.html [19:05:14] "Alerts for new Emails based on filters like From / Subject values" [19:05:19] mutante: That's what I do now; but it'd be made 1000% easier if the email contained an "importance" field or somesuch. [19:06:03] Coren: hmm, so what *I* do is ignore the puppet failure emails :P [19:06:21] Coren: since those are flakey. Something like puppetmaster logrotating will cause them to fail [19:06:27] hmm. you mean a new header like spamassasing would add X-SpamScore or so? [19:06:29] Coren: and everything else pops up on my phone [19:06:40] Coren: those are easy enough to filter... [19:06:47] Coren: and I look at puppet failures when I’m on IRC. [19:06:47] mutante: That, or even just in the body. [19:06:50] and that’s good enough [19:07:02] YuviPanda: Yeah, I'll probably do that with just special cases for the labstores [19:07:05] and the volume of non puppet-failure related errors is low enough [19:07:13] that it’s worth putting them through [19:07:14] I think [19:07:42] Coren: labstore will have to be in icinga, though. And here’s the task for that :) https://phabricator.wikimedia.org/T92629 [19:08:31] YuviPanda: Right, bad example. The "wake-me-up" level for shinken would be the proxies. [19:08:39] Coren: right. [19:09:13] There's not enough of us to rely on /someone/ being aware of a big fail, especially during the weekend. [19:09:23] (Something ops in general can get away with) [19:09:47] Coren can you approve my labs-l post? [19:09:49] I think the easiest way to do this is in the client. we could do it in shinken, if we want to, but just doing it in the client is flexible enough, I think. [19:10:21] Yeah, that's what I'll do. [19:13:02] how do I enable X11 forwarding for a labs puppet role? [19:13:24] I found @ssh_x11_forwarding which I can presumably set via hiera, but is there any way to that from a role? [19:14:33] tgr: It was meant to be set either via hiera or through the wikitech interface. [19:15:33] (configure instance -> ssh -> ssh_x11_forwarding set to 'yes') [19:16:17] I see, thanks [19:36:34] andrewbogott: beginnings of the ENC here: https://gerrit.wikimedia.org/r/#/c/196628/3 [19:47:01] 6Labs: Setup a REST service that helps do NFS exports for OpenStack instances properly - https://phabricator.wikimedia.org/T92638#1117598 (10scfc) This feels like a bridge too short. AFAIUI, `manage-nfs-volumes-daemon` serves two purposes: # Create new directories to be mounted for newly created projects. # Au... [19:50:06] 6Labs, 5Patch-For-Review: New disk partition scheme for labs instances - https://phabricator.wikimedia.org/T87003#1117605 (10thcipriani) [19:55:41] 6Labs: Setup a REST service that helps do NFS exports for OpenStack instances properly - https://phabricator.wikimedia.org/T92638#1117611 (10coren) Oh god no! The Right Way™ is to most certainly //not// have mountd have any external dependencies on being able to allow a mount. Ever. NIS isn't a rabbit hole, i... [20:05:26] hi, I requested access to Tool Labs yesterday, and I haven't heard back from an admin yet, but the help page suggested the wait is typically shorter. should I wait longer, or did I miss anything perhaps? [20:08:04] ggp: hey! [20:08:10] ggp: what’s your username on wikitech? [20:08:24] YuviPanda: Surlycyborg [20:11:15] ggp: done@ [20:11:19] welcome to toollabs :D [20:11:30] YuviPanda: great, thank you very much! :) [20:14:45] ggp: yw [20:41:57] wikibugs is AWOL from most channels again. Currently only in #wikimedia-dev and #wikimedia-labs (afaict) [20:42:14] hi quiddity [20:42:19] ;-) <3 [20:42:48] I touched it, lets see if that works [20:43:13] for the record, -feed works fine [20:43:51] -feed? 13:43 [@ChanServ] [ Eloquence] [ mutante] [20:43:52] whois only lists the channels I share with it. My irc-fu ends there... [20:44:04] #mediawiki-feed [20:44:17] oh, of course [20:44:20] the variety [20:44:21] i tried -feed for a while, but it didn't really help me with anything. [20:44:55] One Channel to Rule Them All, and In The Darkness Bind Them. [20:44:57] !log tools.wikibugs Updated channels.yaml to: 9eafe437ff005a3232e7f7e89dbb2be54437f76c tox: Rename channels env to standard py34 [20:45:01] Logged the message, Master [20:45:56] lol, 3 users there [21:09:12] krenair@fluorine:/a/mw-log$ tail eventlogging.log -n 3 [21:09:13] 2015-03-13 21:08:48 silver labswiki: wgEventLoggingFile has not been configured. [21:09:13] 2015-03-13 21:08:49 silver labswiki: wgEventLoggingBaseUri has not been configured. [21:09:13] 2015-03-13 21:08:49 silver labswiki: wgEventLoggingFile has not been configured. [21:09:13] krenair@fluorine:/a/mw-log$ wc -l eventlogging.log [21:09:14] 134354 eventlogging.log [21:09:15] krenair@fluorine:/a/mw-log$ grep -v labswiki eventlogging.log [21:09:17] krenair@fluorine:/a/mw-log$ [21:09:54] labswiki has filled the eventlogging log with 134354 errors about not being configured :| [22:27:17] still no wikibugs, in channels such as #mediawiki-visualeditor and #wikimedia-collaboration ( legoktm, you probably know, or are busy with higher priority things, but just in case. :) [22:27:34] gah [22:28:02] <3 [22:28:26] quiddity: I filed a bug for this, basically it's having issues joining channels but it doesn't know that it didn't join the channel. And the last time I tried putting in a hack to work around it, it started working again so I stopped investigating [22:29:02] !log tools.wikibugs restarted wb2-irc to see if it rejoins channels properly [22:29:06] Logged the message, Master [22:29:13] okie. lemme know if I can help (eg by periodically checking?) [22:29:59] right now I configured it to autojoin -feed, -dev, and -labs so if it screws up the impact should be less...I guess I could just start adding more channels to that autojoin [22:40:33] legoktm: -releng plz [22:40:55] :) [22:42:08] andrewbogott, hey, please see my messages above [22:43:40] Krenair: the labswiki config is managed by the standard deployment system, so it should be straightforward for you to submit a patch if you like. It would take me several hours before I knew where to start… [22:47:09] 6Labs: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1118300 (10Andrew) 3NEW a:3Andrew [22:48:35] 6Labs, 6operations: Replicate or back up glance image data on virt1000 - https://phabricator.wikimedia.org/T90628#1118307 (10Andrew) [22:48:36] 6Labs: Get Labs openstack service dbs on a proper db server - https://phabricator.wikimedia.org/T92693#1118300 (10Andrew) [22:54:01] 6Labs: db servers for designate and labs pdns - https://phabricator.wikimedia.org/T92694#1118318 (10Andrew) 3NEW a:3Andrew [22:55:33] 6Labs: db servers for designate and labs pdns - https://phabricator.wikimedia.org/T92694#1118318 (10Andrew) Note that both virt1000 and holmium (soon to be the designate server) are in a public vlan, not in labs-private. The stuff in labs-private talks to them via rabbitmq and db calls are marshalled through th... [22:55:59] Krenair: or you could start by filing a bug :) [23:37:47] someone can drop my ssh key into a bastion host? :) [23:42:02] einyx: :) https://wikitech.wikimedia.org/wiki/Help:Contents#Requesting_Shell_Access