[00:02:30] 6Labs, 10Tool-Labs, 6Collaboration-Team-Backlog, 10Flow, 10wikitech.wikimedia.org: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792#2078239 (10bd808) @Krenair would storing in the main db make things easier or harder for... [00:16:04] RECOVERY - Puppet failure on tools-webgrid-lighttpd-1201 is OK: OK: Less than 1.00% above the threshold [0.0] [00:17:51] 6Labs, 10Labs-Infrastructure, 6Operations: Estimate hardware requirements for relevance lab elasticsearch servers - https://phabricator.wikimedia.org/T128433#2078294 (10EBernhardson) Sounds reasonable. I will put the ask for 2x nobelium level hardware in the strategic goals portion of discovery budget with a... [01:12:24] 6Labs, 10Tool-Labs, 6Collaboration-Team-Backlog, 10Flow, 10wikitech.wikimedia.org: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792#2078480 (10Krenair) Hm, wikitech-static, good point. I don't think it matters as long as... [01:16:44] 6Labs, 10Tool-Labs, 6Collaboration-Team-Backlog, 10Flow, 10wikitech.wikimedia.org: Enable Flow on wikitech (labswiki and labtestwiki), then turn on for Tool talk namespace - https://phabricator.wikimedia.org/T127792#2078483 (10Mattflaschen) We have our own dump script (Flow/maintenance/dumpBackup.php).... [01:35:26] 6Labs, 10Tool-Labs, 6Operations, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2078533 (10Dzahn) works now:) root@tools-proxy-01:~# tail -f /var/log/nginx/access-scheme.log shows first results [01:40:22] 6Labs, 10Tool-Labs, 6Operations, 10Traffic, and 2 others: Detect tools.wmflabs.org tools which are HTTP-only - https://phabricator.wikimedia.org/T128409#2078538 (10Dzahn) here's a first list of tools using http, status 200 ``` add-information admin anagrimes anomiebot anomiebot HTTP ~apper apple-touch-ic... [02:08:05] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2078609 (10Bianjiang) I guess @ezachte has answered my question, or maybe there are other options i was not aware of ... my use case: I build logic to extract v... [09:35:25] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:40:25] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 797565 bytes in 8.165 second response time [09:46:28] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:56:23] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 797587 bytes in 6.416 second response time [10:43:05] 503 Service Temporarily Unavailable? [10:43:11] hm, small hiccup [11:14:43] andrewbogott: there's a disk space warning for labvirt1008 in icinga [12:32:59] 6Labs, 10MediaWiki-extensions-OpenStackManager, 13Patch-For-Review: Additions and removals of project members are very hard to decipher in the wiki diff - https://phabricator.wikimedia.org/T128001#2079598 (10Krenair) a:3Krenair [12:50:02] 10Tool-Labs-tools-Other: Zoomviewer disfunctionally laggy for very large images - https://phabricator.wikimedia.org/T128580#2079661 (10Aklapper) Unrelated to #Commons hence removing that tag from this task. Adding #Tool-Labs-Tools-other instead. http://tools.wmflabs.org/ lists @Dschwen as maintainer, hence addin... [13:33:45] 10Tool-Labs-tools-Other: Zoomviewer disfunctionally laggy for very large images - https://phabricator.wikimedia.org/T128580#2079793 (10dschwen) Will take a look. Thanks for the bug report. [14:46:10] 10Tool-Labs-tools-Other: Zoomviewer disfunctionally laggy for very large images - https://phabricator.wikimedia.org/T128580#2079919 (10Fae) >>! In T128580#2079661, @Aklapper wrote: > Unrelated to #Commons hence removing that tag from this task. Adding #Tool-Labs-Tools-other instead. I'm unsure why this is unrel... [14:48:56] andrewbogott: Remember the rezabot stuff last week? Looks like the tool has two more jobs being problematic now: 3758653 "lighttpd-rezabot" on tools-webgrid-lighttpd-1210 and 3956069 "porbinandeh" on tools-exec-1221. [14:49:39] anomie: does that mean they are not working as intended or they are doing harmful things hurting other jobs? [14:50:04] chasemp: They're a large part of the reason the red line on https://grafana.wikimedia.org/dashboard/db/authentication-metrics?panelId=13&fullscreen is very high. [14:50:45] It's broken code trying and failing to log in several times per second, and the operator has been pinged on wikis and doesn't seem to care. [14:51:37] So finally we just had the jobs killed, and relevant crontab entries disabled. [14:51:51] some mystery mechanism started it? [14:53:20] It's presumably not handling cookies correctly so it broke when SessionManager was deployed. It's amazing what sorts of fragile hacks people will use because it happens to work at the time they're writing it, when there are libraries and such that will do things correctly for them. [14:54:24] Chances are if the operator would just upgrade pywikibot it would start working again... [14:54:37] anomie: is the thinking then to pull the plug on those jobs? [14:54:48] and do we know why they were (re)started? [14:55:23] chasemp: Yeah. They're stuck in a loop trying to log in, so they're pretty clearly not doing anything *useful*. Just wasting CPU cycles and flooding our logs, like that graph. [14:55:43] These two weren't acting up last week, so they weren't killed then. [14:56:29] !log tools qdel 3956069 and 3758653 for abusing auth [14:56:30] done [14:56:32] I don't know whether they're monthly cronjobs, or just things that were still logged in last week but the session expired this week. [14:56:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL, Master [14:56:50] chasemp: Thanks! Check the crontab too to make sure they're not restarted? [14:56:58] I'll look about a bit to see if I can figure out why they started after lunch, I'm wrapped up in something atm [14:57:00] sure [14:59:17] 6Labs, 10Dumps-Generation, 6Operations, 10wikitech.wikimedia.org: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2079927 (10Dzahn) a:3Dzahn [15:01:58] chasemp: they are most likely started by crons; the others were [15:06:40] ok andrewbogott thanks, hey did you see the wrn about disk on labvirt1008 [15:06:53] There are two other bots with this problem currently :( One I notified last Thursday, so if the operator doesn't fix it by tomorrow (which seems unlikely) I'll probably ping for that one to get the same treatment. The other I just notified now. [15:06:58] chasemp: I did but I haven’t done anything about it yet [15:07:28] anomie: could you possibly make a ticket describing teh problem and we can link all offenders in and note we have pinged them etc? [15:07:36] as teh list grows we'll lose track on our end for sure [15:08:37] chasemp: T124252 describes the problem, but if you want one that's more specific to "Broken bots on Tool Labs" than "needtoken error spike" I can do that. [15:08:37] T124252: NEED_TOKEN error spike when 1.27-wmf.11 SessionManager was deployed to group1 - https://phabricator.wikimedia.org/T124252 [15:09:54] chase, did you sign yourself up for shinken yet? the tools home alert fired twice more overnight [15:10:17] Probably spikes in activity on the part of other tools on the same host [15:11:01] anomie: that's good let's just try to toss a note in as we track them down and disable I guess [15:11:12] andrewbogott: no I haven't, any thoughts on why that fails but the main prod one doesn't? [15:11:39] I don’t know, although I think the main one maybe doesn’t fire until there are two failures, or something like that? [15:11:45] chasemp: Ok, I'll add a comment to that bug about which bots have been handled so far. [15:12:53] andrewbogott: ok I feel like I don't know nearly enough about how that check works in both cases [15:13:02] ok [15:13:02] I'll try to check it out [15:13:04] andrewbogott: thanks [15:13:19] maybe we should chat about this when yuvipanda shows up. Right now I’m just ignoring those alerts, and I suspect he is too [15:14:31] I'm unsure on teh practicality of shinken paging atm and esp duplicate checks in different systems with different params [15:15:02] I'm not saying I"m against it just yeah it's unclear what the strategy is there and I don't want to fatigue us all for nothing that's for sure [15:18:08] andrewbogott: let's try to rope in yuvi and do like 30m convo or so? + just general catchup maybe tomrorow idk [15:18:24] chasemp: yep, today or tomorrow is fine [15:19:01] 6Labs, 10Dumps-Generation, 6Operations, 10wikitech.wikimedia.org, 13Patch-For-Review: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2079954 (10Dzahn) p:5Lowest>3Low [15:19:38] 6Labs, 10Tool-Labs-tools-Other, 6Developer-Relations: Create an authoritative and well promoted catalog of Wikimedia tools - https://phabricator.wikimedia.org/T115650#2079961 (10Qgil) FYI, there is a proposal to include this epic task as part of the Technical Collaboration team annual plan for FY2016-17 -- s... [16:01:46] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2080083 (10Nuria) >So I need to have pageview for all articles. You can get that from pageview API, please see docs: https://wikitech.wikimedia.org/wiki/Analytic... [16:11:28] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:16:25] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 797707 bytes in 6.334 second response time [16:18:34] Is there a project for tool-labs tools without own projects? [16:18:50] Luke081515: tools-projects-other or something like that [16:20:18] valhallasw`cloud. I just found Labs-Other-Projects [16:20:33] Tool-Labs-tools-Other [16:21:00] ah, thanks [17:01:26] having an odd issue i can't explain with the vagrant lxc box ... i'm scripting up something that runs maint scripts inside vagrant from my local machine, so the cmd looks something like: cat list_of_queries | ssh searchdemo.eqiad.wmflabs 'cd /srv/mediawiki-vagrant && vagrant ssh -- mwscript extensions/CirrusSearch/maintenance.runSearch.php --baseName enwiki' [17:01:44] this errors out complaining i need to use setup.sh, but the box is already setup and running [17:01:58] i tried wrapping in bash -lc "..." to simulate a login shell in case there was some missing env, but no luck [17:02:31] ebernhardson: change "vagrant" to "/usr/local/bin/mwvagrant" and see if that works [17:03:02] bd808: doh, that works :) [17:03:04] there is a profile.d script that aliases "vagrant" to "/usr/local/bin/mwvagrant" [17:03:04] bd808: thanks! [17:04:04] ebernhardson: yw [17:04:18] * bd808 disavows writing that whole mess last summer [17:10:49] subbu: I would like to migrate the instance ‘parsoid-spof’ to another virt host. That will result in an hour or two of downtime. Will that disrupt anyone? [17:11:02] Krenair: same question ^ [17:11:18] I have no idea what happens in that instance [17:11:29] I don't think I have anything connecting to it [17:11:32] Krenair: ok — any thoughts about who else to ask, besides subbu? [17:11:46] parsoid developers [17:11:52] other than that, not really [17:11:56] so that would be subbu :) [17:11:58] ok, thanks [17:12:02] andrewbogott, no disruption that i know of. [17:12:12] subbu: anyone else I should check with before I turn it off? [17:12:29] subbu: or, better yet, is it obsolete? Deleting is easier than migrating :) [17:12:34] one sec ... let me look at the instance. [17:12:46] (it’s huge, hence in the crosshairs) [17:14:34] i don't remember creating it .. i am in a hangout right now .. can i get back in 30 mins? [17:14:49] subbu: yep! [17:14:50] thanks [17:15:05] subbu: I don’t know who created it, it’s in the ‘visualeditor’ project [17:16:19] andrewbogott, check with g.wicke / roan [17:16:45] it is on from 2014 ... so i suspect one of them created it. [17:16:59] and yes, better to delete something that is not being used. [17:22:32] andrewbogott: chasemp my suspicion for the shinken failure is that internal DNS is failing (since icinga isn't failing) [17:22:43] andrewbogott: chasemp +1 for getting rid of it from shinken because we have it in icinga [17:23:15] internal dns? Would I get a flood of ‘security alert unable to resolve…’ in that case? [17:23:27] depends on how and when it fails :) [17:23:33] maybe it's the machine itself overloading and failing [17:27:31] PROBLEM - SSH on tools-webgrid-lighttpd-1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:29:10] 6Labs, 10Tool-Labs, 6Operations: Get rid of Tool Labs home page check from shinken - https://phabricator.wikimedia.org/T128615#2080500 (10yuvipanda) [17:32:00] chasemp: andrewbogott ^ filed it [17:32:10] k [17:32:23] RECOVERY - SSH on tools-webgrid-lighttpd-1204 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2~wmfprecise2 (protocol 2.0) [17:32:33] * yuvipanda goes back to conferencing [17:34:57] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2080515 (10Milimetric) >>! In T120497#2078609, @Bianjiang wrote: > my use case: > I build logic to extract various of things from an article, and the logic may f... [17:39:41] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2080530 (10ezachte) @Nuria depends on what @Biangjang meant: I thought separate counts for each article. That might work in theory, not in practice, for largest... [17:44:01] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2080550 (10Milimetric) @ezachte oooh!!! I thought you did that but then I looked at the docs on the page and they still say they're a derivative of pagecounts-ra... [17:48:41] on wikitech, there is a /a/backup/public with dumps, but even though it's called public, it actually is "Forbidden" (https://wikitech.wikimedia.org/dumps) because the Apache config says "Require host wikitech-static.wikimedia.org". Is there any reason to do that? Can we simply make it public? [17:49:05] this question re: https://phabricator.wikimedia.org/T54170 [17:50:02] gwicke, andrewbogott was asking earlier about parsoid-spof vm in the visualeditor project. [17:50:08] 10Tool-Labs-tools-nlwikibots: Archiving nomination notices send a ping - https://phabricator.wikimedia.org/T128617#2080577 (10Akoopal) [17:50:16] do you know what it is being used for. I don't remember creating it .. neither does Roan. [17:50:40] that one has been online since 2014. [17:50:49] so i figured you must have created it. [17:50:54] subbu: gwicke created it [17:51:05] (I think?) [17:51:13] I remember talking to him about it a long time ago about it [17:52:10] 6Labs, 10Dumps-Generation, 6Operations, 10wikitech.wikimedia.org, 13Patch-For-Review: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2080589 (10Dzahn) >>! In T54170#1410096, @ArielGlenn wrote: > should we host the lastest dump on dumps.wm.org? We can open up the dumps... [17:52:26] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:52:49] hmm [17:55:46] 6Labs, 10Dumps-Generation, 6Operations, 10wikitech.wikimedia.org, 13Patch-For-Review: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2080618 (10Dzahn) @Andrew @yuvipanda any thoughts on the restriction on that /a/backup/public ? please see https://gerrit.wikimedia.org... [17:55:52] 10Tool-Labs-tools-Other, 6Community-Tech, 7Community-Wishlist-Survey, 7Milestone: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2080620 (10ezachte) I was ging to update those docs, then I forgot. My bad. [17:57:21] RECOVERY - ToolLabs Home Page on toollabs is OK: HTTP OK: HTTP/1.1 200 OK - 797635 bytes in 5.718 second response time [18:02:11] RECOVERY - Puppet failure on tools-webgrid-generic-1401 is OK: OK: Less than 1.00% above the threshold [0.0] [18:05:41] subbu, yuvipanda: afaik the only remaining use was the parsoid repo, but IIRC we moved that to releases.wikimedia.org [18:06:05] yeah: https://www.mediawiki.org/wiki/Parsoid/Setup#Ubuntu_.2F_Debian [18:06:23] no reason to keep the VM from my side [18:06:29] andrewbogott, ^ [18:07:10] gwicke: I will migrate it, turn it off, and then we’ll wait a few days to see if anyone complains [18:07:43] it's a spof, so failure is not unexpected [18:08:04] thanks for cleaning up after us! [18:08:05] failures… that matter? Or that don’t? [18:08:15] Actually, is that whole project maybe defunct? [18:08:34] it was a joke playing on the instance name [18:08:51] gwicke: ah, ok :) [18:09:14] I'm sure there are some old installs referencing this still, but they are better served by getting errors & then fixing it than simply not getting any updates any more [18:09:31] 6Labs: Maybe delete instance parsoid-spof in 'visualeditor' project - https://phabricator.wikimedia.org/T128620#2080663 (10Andrew) [18:11:08] !log visualeditor migraiting parsoid-spof to labvirt1010 after which it will be left in shutdown state [18:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Visualeditor/SAL, dummy [18:21:50] is there a good pattern in ops/puppet (labs) for authorizing a project's admins ssh keys for a particular user? [18:32:57] marxarelli: there is magic for shared key access to the root account somewhere in puppet/hiera [18:33:27] marxarelli: grep around for passwords::root::extra_keys [18:35:07] bd808: hm, not finding anything [18:36:11] marxarelli: hmm you're right... where does this hide [18:36:57] bd808: also, i'm looking to do this for a deploy user, not root [18:37:25] marxarelli: ah. it's in labs/private -- modules/passwords/manifests/init.pp & modules/passwords/templates/root-authorized-keys.erb [18:37:27] one idea i had was to define a `ssh::userkey` resource that joins an array of keys defined in hiera [18:37:47] and then just overwrite that using the Hiera ns on wikitech [18:38:44] marxarelli: shouldn't it work like scap does and just grant sudo to a deploy account with a single key managed by keyholder? [18:39:12] bd808: later maybe. this is an existing Rails app that uses capistrano for deploys :) [18:39:17] and we [18:39:25] we're working on it as an ad hoc team [18:39:36] so, not a ton of time to refactor the deployment [18:39:48] there was a bunch of gross stuff I had to do for the ssh access of mwdeploy in beta cluster [18:39:56] I don't know if that's gone now or not [18:40:31] marxarelli: can't capistrano have a sudo step? [18:41:00] so that the ssh connections are done as deployer X but the scripts usdo to shared user Y on the target host? [18:41:10] *sudo [18:42:09] bd808: i _think_ sudo stuff is explicit in capistrano, but i'll look into it more [18:42:49] i.e. i'm not sure i can refactor the cap configs to execute _everything_ via sudo [18:43:05] it's been a while [18:44:19] win 2 [18:46:40] yuvipanda: yeah the shinken check fails far more often, and I honestly don't think for the last week or so that it would be nfs, maybe? but somethinga bout the check is more prone [18:46:55] dns does seem like a candidate [18:52:27] 6Labs, 10Labs-Infrastructure, 6Operations: Estimate hardware requirements for relevance lab elasticsearch servers - https://phabricator.wikimedia.org/T128433#2080779 (10EBernhardson) [18:52:44] 6Labs, 10Labs-Infrastructure, 6Operations: Estimate hardware requirements for relevance lab elasticsearch servers - https://phabricator.wikimedia.org/T128433#2074588 (10EBernhardson) [19:12:24] chasemp: could also be load on the machine [19:40:44] 6Labs, 10Labs-Sprint-104: Recover files from old corrupted file system (Tracking) - https://phabricator.wikimedia.org/T104334#2080999 (10chasemp) 5Open>3Resolved we can open it if something further came up [19:54:18] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 10Monitoring, 6Operations: Ensure mysql credential creation for tools users is running - https://phabricator.wikimedia.org/T125874#2081132 (10chasemp) 5Open>3Resolved It's been ok for awhile now, we can reopen if it starts breaking again :) [19:56:38] 6Labs, 10Labs-Infrastructure, 13Patch-For-Review: LDAP is not working - https://phabricator.wikimedia.org/T122757#2081136 (10chasemp) 5Open>3Resolved Let's revisit if this starts happening [19:57:04] 6Labs, 10Labs-Infrastructure: intermittent DNS resolve problems with wmflabs domains - https://phabricator.wikimedia.org/T65709#2081140 (10chasemp) [19:57:06] 6Labs, 10Labs-Infrastructure: Start pdns after opendj - https://phabricator.wikimedia.org/T65717#2081138 (10chasemp) 5Open>3Resolved there is no opendj now :) [20:11:37] 6Labs, 10Labs-Infrastructure, 7Upstream: Allow login using mosh as an alternative to plain ssh on bastion - https://phabricator.wikimedia.org/T54693#2081203 (10chasemp) 5Open>3Invalid I am tossing this for now since mosh doesn't support the methodology used by labs in general. What can be done was done... [20:13:07] 6Labs, 10Labs-Infrastructure: Switch DNS middleware to Moniker - https://phabricator.wikimedia.org/T48818#2081209 (10chasemp) 5Open>3Invalid this is wildly different now :) [20:13:09] 6Labs, 10Labs-Infrastructure, 10Tool-Labs-tools-meetbot: restore Meetbot logs from around 2015-06 lost in NFS outage - https://phabricator.wikimedia.org/T113000#2081212 (10yuvipanda) a:3yuvipanda I don't think those files still exist - I'll take a look this week and close as declined / actually recover the... [20:14:44] 6Labs, 10Labs-Infrastructure: Switch DNS middleware to Moniker - https://phabricator.wikimedia.org/T48818#2081216 (10chasemp) [20:14:45] 6Labs, 10Labs-Infrastructure: Write a Moniker backend for gdnsd - https://phabricator.wikimedia.org/T48825#2081215 (10chasemp) 5Open>3Invalid [20:14:57] 6Labs: Maybe delete instance parsoid-spof in 'visualeditor' project - https://phabricator.wikimedia.org/T128620#2081217 (10Andrew) Update: I didn't migrate it, I just shut it down in place. With luck I can delete it in a few days. [20:17:30] 6Labs, 10Beta-Cluster-Infrastructure: Completely remove Beta Cluster dependency on NFS - https://phabricator.wikimedia.org/T102953#2081227 (10yuvipanda) [20:17:32] 6Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Disable /data/project for instances in deployment-prep that do not need it - https://phabricator.wikimedia.org/T125624#2081225 (10yuvipanda) 5Resolved>3Open Bah, since I didn't delete them from `/etc/fstab` they would be back when restarted, which... [20:19:15] chasemp, andrewbogott: The "lighttpd-rezabot" job came back. [20:20:02] making small change to wikitech apache config in puppet, not expecting any change though [20:20:10] anomie: that’s the one that chase killed today? [20:20:13] Or one I killed a while ago? [20:20:27] andrewbogott: That's one of the ones chase killed today. [20:20:31] we will just have 1 Apache config for all openstack versions now [20:20:37] anomie: ok, he probably didn’t purge the crontab [20:21:21] andrewbogott: Or try "webservice stop" as the tool, I forget exactly how the lighttpd jobs work. [20:21:26] ok, done, the MD5 changed but that was only the fixed comment line [20:21:36] puppet run on silver done [20:29:19] anomie: I can’t pay attention to this right now, maybe make a phab task if chase isn’t able to follow up [20:30:34] I killed that job and am looking at the cron entries now andrewbogott anomie [20:31:35] chasemp: Try su-ing to the tool and do "webservice stop"? It came back already. [20:31:57] 6Labs, 10Dumps-Generation, 6Operations, 10wikitech.wikimedia.org, 13Patch-For-Review: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2081298 (10Dzahn) >>! In T54170#1362915, @Krenair wrote: > and also, wikitech's /dumps directory is protected by `Require host wikitech-... [20:32:12] done [20:32:40] Thanks, that seems to have done it. [20:37:42] 6Labs, 10Dumps-Generation, 6Operations, 10wikitech.wikimedia.org, 13Patch-For-Review: Provide dumps of wikitech.wikimedia.org - https://phabricator.wikimedia.org/T54170#2081323 (10Krenair) Was added in https://gerrit.wikimedia.org/r/#/c/189889/ Should've been caught in review [20:46:16] anomie: let me know if it comes back now [20:47:09] 6Labs, 10Labs-Infrastructure, 10Dumps-Generation: Create -latest alias for dumps - https://phabricator.wikimedia.org/T47646#2081374 (10chasemp) [20:49:36] 6Labs: Maybe delete instance parsoid-spof in 'visualeditor' project - https://phabricator.wikimedia.org/T128620#2080663 (10Jdforrester-WMF) @Andrew I believe this is used by @cscott; might be worth waiting for him to be back before making any decisions. [20:50:25] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:50:46] 6Labs: role::simplelamp fails to start mysql due to apparmor - https://phabricator.wikimedia.org/T128642#2081399 (10EBernhardson) [20:54:30] 6Labs: role::simplelamp fails to start mysql due to apparmor - https://phabricator.wikimedia.org/T128642#2081461 (10EBernhardson) I was able to workaround by doing the following: ``` sudo ln -s /etc/apparmor.d/usr.sbin.mysqld /etc/apparmor.d/disable/ sudo service apparmor reload sudo puppet agent -tv ``` [20:55:16] * ebernhardson hangs head in shame at lazy way of fixing things [21:01:19] 6Labs, 10Labs-Infrastructure, 10Tool-Labs-tools-meetbot: restore Meetbot logs from around 2015-06 lost in NFS outage - https://phabricator.wikimedia.org/T113000#1651914 (10scfc) In the task description @Spage also suggested that the minutes could be recreated by feeding the IRC logs of `wm-bot` to `meetbot`.... [21:25:56] 10Tool-Labs-tools-nlwikibots: Archiving nomination notices send a ping - https://phabricator.wikimedia.org/T128617#2081651 (10Akoopal) roughly 3 options that can be quick: - Don't let the bot sign the nomination message - Change the bot signature to something non-standard so it doesn't trigger a ping. - Have th... [21:26:51] 6Labs, 10Labs-Infrastructure, 6Operations: labservices1001 ran out of disk space - https://phabricator.wikimedia.org/T126572#2081667 (10chasemp) 5Open>3Resolved I spoke with andrew and this was always rotated, the problem came from an increased debug level of logging for designate to troubleshoot the rec... [21:36:31] 10Tool-Labs-tools-nlwikibots, 6Collaboration-Team-Backlog, 10Notifications: Archiving nomination notices send a ping - https://phabricator.wikimedia.org/T128617#2081740 (10Mattflaschen) [21:41:30] mutante: it seems i messed up the hieradata in https://gerrit.wikimedia.org/r/#/c/272906/ as well [21:42:13] which i'm guessing is on account of there being no role hierarchy in labs? [21:44:09] getting "Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Must pass dependencies to Class[Programdashboard::App] at /etc/puppet/modules/role/manifests/programdashboard/app.pp:7" [21:44:12] mutante: any ideas? [21:44:21] marxarelli: yeah, the role keyword magic doesn't work in Labs [21:45:08] bd808: what's the best place to put the programdashboard hiera stuff? [21:45:17] marxarelli: for the labs hosts you can do something like this -- https://github.com/wikimedia/operations-puppet/blob/production/hieradata/labs/stashbot/common.yaml [21:45:30] i tried labs/programdashboard.yaml as well (based on what i saw in the nuyaml backend) [21:45:32] or you can stuff it all in the Hiera: wikitech page [21:45:37] eh, separately would it be "role/common/programdashboard/app.yaml ? [21:46:02] there is also a special wiki page in labs [21:46:07] where you can add hieradata [21:46:13] oy. i'd like to avoid Hiera: in this case, except for overrides [21:46:50] marxarelli: hieradata/labs/$your_project/common.yaml [21:54:36] bd808: bad idea to just use hieradata/labs/common.yaml? otherwise, the role is sort of useless anywhere else but in our project [21:56:49] bd808: oh, nm. there is no labs/common.yaml :) [21:57:46] marxarelli: the "right" fix would be to reengineer all role application in Labs to make it compatible with the keyword magic that is used in prod ;) [21:58:24] * marxarelli pretends he didn't read that [22:00:13] 10Tool-Labs-tools-nlwikibots, 6Collaboration-Team-Backlog, 10Notifications: Archiving nomination notices send a ping - https://phabricator.wikimedia.org/T128617#2080577 (10Mattflaschen) Possible solution: On-wiki blacklist per https://www.mediawiki.org/wiki/Extension:Echo#Usage . That will block all Echo n... [22:02:55] marxarelli: the other more general option is not to use role/* in prod either and instead put your config in hieradata/common/role/* [22:03:06] that config is inherited by Labs instances [22:03:35] or common/$module/* [22:03:38] bd808: do you mean hieradata/common/*? [22:03:40] ah [22:03:51] ok, i thought i tried that [22:03:54] i'll try again [22:05:05] bd808: for a param named $foo::bar::baz the nuyaml backend should look in hieradata/common/foo/bar.yaml, right? [22:05:22] yes [22:05:35] and the variable in bar.yaml should be `baz: whatev`? [22:05:49] yup [22:05:58] okey doke. i'll give that a shot [22:05:59] thanks! [22:06:25] yw [22:19:06] bd808: damn. not working for me. i have a `hieradata/common/programdashboard/app.yaml` and still get "Must pass dependencies to Class[Programdashboard::App] at /etc/puppet/modules/role/manifests/programdashboard/app.pp:7" [22:25:29] marxarelli: hmmm [22:29:10] bd808: http://hastebin.com/oluhijuteb.coffee [22:29:48] the "Searching for programdashboard::app::directory in" lines look suspicious [22:40:25] bd808: oh, it's because there's no expand_path for nuyaml in the labs hiera.yaml [22:40:36] so it won't try any fancy lookup [22:41:57] i'll submit a patch for expansion in `common` at least [23:06:55] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 6Discovery, and 3 others: labstore monitoring - "Last run result for unit .. was exit-code" - https://phabricator.wikimedia.org/T128526#2082170 (10chasemp) [23:06:57] 6Labs, 6Operations: revise replicate backup jobs - https://phabricator.wikimedia.org/T127567#2082171 (10chasemp) [23:07:44] (03PS1) 10Matanya: adding election.php [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/274573 [23:10:33] (03CR) 10Matanya: [C: 032 V: 032] adding election.php [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/274573 (owner: 10Matanya) [23:14:36] (03PS1) 10Matanya: add stewardbots doc [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/274574 [23:15:37] (03CR) 10Matanya: [C: 032 V: 032] add stewardbots doc [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/274574 (owner: 10Matanya) [23:16:04] 6Labs, 6Operations: revise replicate backup jobs - https://phabricator.wikimedia.org/T127567#2082209 (10chasemp) Backups have been failing again and I had a few moments to look into things (I am also merging in a task daniel made -- thanks daniel -- as however we address this needs to be systemic). This is fa... [23:18:22] (03PS3) 10MarcoAurelio: Updating HTML for main ~stewardbots page. [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/273772 [23:18:26] (03PS1) 10Matanya: move index.html to dic root [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/274575 [23:18:48] (03CR) 10Matanya: [C: 032 V: 032] move index.html to dic root [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/274575 (owner: 10Matanya) [23:18:58] 6Labs, 6Operations: revise replicate backup jobs - https://phabricator.wikimedia.org/T127567#2082213 (10chasemp) [23:21:26] 6Labs: tools replication is failing between labstore1001 and labstore2001 - https://phabricator.wikimedia.org/T124310#2082232 (10chasemp) [23:21:39] 6Labs: tools replication is failing between labstore1001 and labstore2001 - https://phabricator.wikimedia.org/T124310#1952556 (10chasemp) [23:21:41] 6Labs, 6Operations: revise replicate backup jobs - https://phabricator.wikimedia.org/T127567#2082234 (10chasemp) [23:22:17] 6Labs, 10Labs-Infrastructure, 10Tool-Labs, 6Operations: labstore - replication to codfw broken or not working yet - https://phabricator.wikimedia.org/T125749#2082235 (10chasemp) [23:22:57] 6Labs, 6Operations: revise replicate backup jobs - https://phabricator.wikimedia.org/T127567#2046998 (10chasemp) [23:23:52] 6Labs, 6Operations: revise/fix labstore replicate backup jobs - https://phabricator.wikimedia.org/T127567#2082241 (10chasemp) [23:31:01] 6Labs, 10Incident-20150331-LabsNFS-Overload, 3ToolLabs-Goals-Q4: Test labstore switchover - https://phabricator.wikimedia.org/T94607#2082268 (10chasemp) [23:31:03] 6Labs: Storage capacity & redundancy expansion (tracking) - https://phabricator.wikimedia.org/T85604#2082266 (10chasemp) 5Open>3Resolved I am resolving for now and it will be reviewed as part of T85604 [23:32:46] (03PS1) 10Matanya: add sulwatcher doc [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/274576 [23:32:47] 6Labs, 6Operations: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2082282 (10chasemp) [23:32:49] 6Labs, 13Patch-For-Review: Replicate data between codfw and eqiad - https://phabricator.wikimedia.org/T85606#2082278 (10chasemp) 5Open>3Resolved anything left here I believe could be considered part of T127567 [23:33:09] (03CR) 10Matanya: [C: 032 V: 032] add sulwatcher doc [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/274576 (owner: 10Matanya) [23:35:49] 6Labs, 6Operations: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2082302 (10chasemp) [23:35:50] 6Labs, 10Tool-Labs: user `marcmiquel` gzipping 41G file on NFS - https://phabricator.wikimedia.org/T124877#2082300 (10chasemp) 5Open>3Resolved thanks @valhallasw. for posterity we have spoken with marcmiquel a few times and he recognizes the issue. There is some mitigation in place to prevent this from m... [23:36:03] 6Labs, 10Tool-Labs, 10Tool-Labs-tools-Other: templatetiger runs `sort` on huge NFS files - https://phabricator.wikimedia.org/T124822#2082303 (10chasemp) Is this still an issue? [23:36:17] (03PS1) 10Matanya: fixing doc path for sulwatcher [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/274580 [23:36:59] (03CR) 10Matanya: [C: 032 V: 032] fixing doc path for sulwatcher [labs/tools/stewardbots] - 10https://gerrit.wikimedia.org/r/274580 (owner: 10Matanya) [23:37:40] 6Labs, 6Operations: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004220 (10chasemp) [23:37:42] 6Labs, 6Operations: One instance hammering on NFS should not make it unavailable to everyone else - https://phabricator.wikimedia.org/T95766#2082313 (10chasemp) 5Open>3Resolved I believe with the rollout of https://gerrit.wikimedia.org/r/#/c/272900/ this has improved greatly. It is not necessarily difficu... [23:38:27] 6Labs, 6Operations: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2082320 (10chasemp) [23:38:28] 6Labs, 10Labs-Sprint-105: NFS exports broken for new projects which want NFS - https://phabricator.wikimedia.org/T104881#2082318 (10chasemp) 5Open>3Resolved I believe this issue is old enough to not warrant attention? re: this works atm [23:39:55] 6Labs, 6Operations: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2004220 (10chasemp) [23:39:57] 6Labs, 10Labs-Sprint-100, 10Tool-Labs: Clean up huge logs on toollabs - https://phabricator.wikimedia.org/T98652#2082340 (10chasemp) 5Open>3Resolved This has happened a few times and we will have to continue to clean things up. I'm hopeful T126623 will make it easier to identify and easier for users to... [23:46:23] 6Labs, 10Labs-Infrastructure, 6Operations: Unable to connect both redundant labstores to the shelves in parallel - https://phabricator.wikimedia.org/T117453#2082414 (10chasemp) [23:46:25] 6Labs, 6Operations: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2082413 (10chasemp) [23:50:30] 6Labs, 10Incident-Labs-NFS-20151216, 6Operations: Investigate better way of deferring activation of Labs LVM volumes (and corresponding snapshots) until after system boot - https://phabricator.wikimedia.org/T121629#2082449 (10chasemp) [23:50:32] 6Labs, 6Operations: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2082448 (10chasemp) [23:52:00] wasn't there a default url for users's public_html on tools? [23:54:07] Platonides: users don't have a public_html on tools (on purpose, so that tools reside under tool accounts) [23:54:57] I think there used to be [23:55:18] PROBLEM - ToolLabs Home Page on toollabs is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - string 'Magnus' not found on 'http://tools.wmflabs.org:80/' - 323 bytes in 0.004 second response time [23:55:24] Platonides: not on tools. [23:55:34] toolserver: yes, bots: maybe? [23:55:42] maybe :/ [23:56:26] * chasemp waves at valhallasw`cloud [23:56:34] chasemp: *waves* [23:57:14] awch, rools is down? [23:57:17] *tools [23:59:08] PROBLEM - Puppet failure on tools-webgrid-lighttpd-1208 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:59:16] PROBLEM - Puppet failure on tools-exec-1408 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:59:18] dns is flaking out atm