[02:12:59] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools-dev.diskspace._var.byte_avail.value (10.00%) [06:31:51] Hello, could any admin reset my two-factor authentication on Labs and Phabricator? I bricked my phone and flashed a new firmware. [06:32:08] s/Labs/Wikitech [06:36:08] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [11:51:48] !ping [11:51:48] !pong [13:24:17] YuviPanda: So, sorry you didn't get to be in the loop as much as you'd have liked but puppet is now able to construct (most of) a gridengine configuration yet remains relatively readable and straightforward (even though the underlying mechanism is teh suk) [14:19:08] 3Wikimedia Labs / 3deployment-prep (beta): broken Upload file link in en-beta navigation - 10https://bugzilla.wikimedia.org/57117#c1 (10Antoine "hashar" Musso (WMF)) p:5Unprio>3Lowest s:5normal>3enhanc On the production wiki that is a link to: https://en.wikipedia.org/w/index.php?title=Wikipedia:Fil... [14:19:09] 3Wikimedia Labs / 3deployment-prep (beta): sync articles from production wikis (css/gadgets) - 10https://bugzilla.wikimedia.org/49779 (10Antoine "hashar" Musso (WMF)) [14:20:24] 3Wikimedia Labs / 3deployment-prep (beta): Make deployment prep have continuous replication lag - 10https://bugzilla.wikimedia.org/57583#c2 (10Antoine "hashar" Musso (WMF)) p:5Unprio>3Lowest Marking to lowest priority since nobody is sponsoring this for now. [14:21:38] 3Wikimedia Labs / 3deployment-prep (beta): Lock wait timeout error on beta enwiki - 10https://bugzilla.wikimedia.org/59751#c2 (10Antoine "hashar" Musso (WMF)) 5NEW>3RESO/FIX Seems it was some transient issue back in January. [14:22:52] 3Wikimedia Labs / 3deployment-prep (beta): Configure all deployment-prep instances to use local salt and puppet master by default - 10https://bugzilla.wikimedia.org/62795#c4 (10Antoine "hashar" Musso (WMF)) p:5Unprio>3High We still need to have to set the configuration manually. Pointing to the local p... [14:25:54] 3Wikimedia Labs / 3deployment-prep (beta): `sql` does not work on deployment-prep - 10https://bugzilla.wikimedia.org/63803#c6 (10Antoine "hashar" Musso (WMF)) 5NEW>3RESO/FIX That has been fixed somehow :-) [14:27:09] 3Wikimedia Labs / 3deployment-prep (beta): false "wiki is read-only mode" message in beta labs - 10https://bugzilla.wikimedia.org/65228#c7 (10Antoine "hashar" Musso (WMF)) 5NEW>3RESO/FIX It seems the main issue was the database lag that would cause MediaWiki to switch readonly. The threshold has been ra... [14:28:11] 3Wikimedia Labs / 3deployment-prep (beta): beta labs no longer listens for HTTPS - 10https://bugzilla.wikimedia.org/68387 (10Antoine "hashar" Musso (WMF)) [14:28:11] 3Wikimedia Labs / 3deployment-prep (beta): Reenable $wgMWOAuthSecureTokenTransfer=true; - 10https://bugzilla.wikimedia.org/65421#c1 (10Antoine "hashar" Musso (WMF)) p:5Unprio>3Low s:5normal>3enhanc Blocked by Bug 68387 - beta labs no longer listens for HTTPS [14:30:43] 3Wikimedia Labs / 3deployment-prep (beta): Submitting Special:ChangePassword gives Internal error - 10https://bugzilla.wikimedia.org/63396#c4 (10Antoine "hashar" Musso (WMF)) *** Bug 66401 has been marked as a duplicate of this bug. *** [14:30:43] 3Wikimedia Labs / 3deployment-prep (beta): Beta cluster centralauth accounts points to no more existing wikis - 10https://bugzilla.wikimedia.org/66401#c3 (10Antoine "hashar" Musso (WMF)) 5NEW>3RESO/DUP That is a duplicate of Bug 63396 - Submitting Special:ChangePassword gives Internal error. The bug eve... [14:31:08] 3Wikimedia Labs / 3deployment-prep (beta): Beta cluster centralauth accounts points to no more existing wikis - 10https://bugzilla.wikimedia.org/63396#c5 (10Antoine "hashar" Musso (WMF)) Reusing topic of duplicate bug 66401 [14:33:54] 3Wikimedia Labs / 3deployment-prep (beta): Requested 115.108.187.192.proxies.dnsbl.sorbs.net., not found in proxies.dnsbl.sorbs.net.. - 10https://bugzilla.wikimedia.org/71894#c5 (10Antoine "hashar" Musso (WMF)) 5NEW>3RESO/WOR The message is logged from includes/User.php: if ( $ipList ) { wfDebugLog(... [14:35:09] 3Wikimedia Labs / 3deployment-prep (beta): ferm policy on deployment-bastion prevents scap rsync from mw hosts - 10https://bugzilla.wikimedia.org/70858#c1 (10Antoine "hashar" Musso (WMF)) 5NEW>3RESO/FIX That was solved since scap works now :) [14:36:26] for salt users in labs, I have cleaned up a lot of dead salt minion processes; all but three hosts are responsive, those last 2 or 3 respond but it can take up to two minutes [14:36:58] I've also accepted all keys on labcontrol2001 ( Coren ) and deleted all keys of deleted instances on virt1000 [14:37:14] this is a prelude to updating all salt minions on active instances, likely wo't get to that today [14:37:38] 3Wikimedia Labs / 3deployment-prep (beta): Yell loudly of failed puppet runs on Beta Cluster instances - 10https://bugzilla.wikimedia.org/67333#c7 (10Antoine "hashar" Musso (WMF)) We now have notifications on irc channel #wikimedia-qa and a few people receives an hourly mail until all instances pass puppet.... [14:37:41] i-00000113.eqiad.wmflabs has ext4 filesystem errors in dmesg [14:38:51] as does i-00000350.eqiad.wmflabs... and that's the news [14:38:54] 3Wikimedia Labs / 3deployment-prep (beta): Yell loudly of failed puppet runs on Beta Cluster instances - 10https://bugzilla.wikimedia.org/67333#c8 (10Antoine "hashar" Musso (WMF)) Blocks Bug 51497 - Setup monitoring for Beta cluster [14:38:54] 3Wikimedia Labs / 3deployment-prep (beta): Setup monitoring for Beta cluster - 10https://bugzilla.wikimedia.org/51497 (10Antoine "hashar" Musso (WMF)) [14:39:34] hashar: in case you care, see the above ^^ [14:39:44] it's general labs, not beta [14:39:54] 3Wikimedia Labs / 3deployment-prep (beta): sync-site-resources should sync all Labs wikis - 10https://bugzilla.wikimedia.org/49791 (10Antoine "hashar" Musso (WMF)) p:5Normal>3Lowest [14:40:08] 3Wikimedia Labs / 3deployment-prep (beta): sync articles from production wikis (css/gadgets) - 10https://bugzilla.wikimedia.org/49779 (10Antoine "hashar" Musso (WMF)) p:5Normal>3Lowest [14:40:23] 3Wikimedia Labs / 3deployment-prep (beta): automatically import some content from production (tracking) - 10https://bugzilla.wikimedia.org/52382 (10Antoine "hashar" Musso (WMF)) p:5Normal>3Lowest [14:40:30] apergos: can you bug fill them please ? :D [14:41:17] what would that go under? should I find out which actual hardware they are on and check that box? [14:41:42] I assume you mean the ext4 failures, not the slow salt, if I don't debug it no one else is going to. ever. [14:41:48] hashar: [14:43:05] apergos: sorry I though that the ext4 issues were instances in the beta cluster :° [14:45:39] 3Wikimedia Labs / 3deployment-prep (beta): easily reload all apaches - 10https://bugzilla.wikimedia.org/36422#c20 (10Antoine "hashar" Musso (WMF)) 5NEW>3RESO/FIX Seems it is fixed now :) Thank you Bryan. [14:46:08] 3Wikimedia Labs / 3deployment-prep (beta): sync Sandbox gadget from production to en.wikipedia.beta.wmflabs.org - 10https://bugzilla.wikimedia.org/47205 (10Antoine "hashar" Musso (WMF)) p:5Normal>3Lowest [14:46:53] 3Wikimedia Labs / 3deployment-prep (beta): GettingStarted extension broken at beta.wmflabs.org - 10https://bugzilla.wikimedia.org/51362 (10Antoine "hashar" Musso (WMF)) p:5Unprio>3Low [14:48:53] 3Wikimedia Labs / 3deployment-prep (beta): monitor that application servers are responding - 10https://bugzilla.wikimedia.org/52867#c6 (10Antoine "hashar" Musso (WMF)) s:5critic>3enhanc Resetting severity. If it was really critical it would have been fixed long ago. Yuvi Panda is working on integrating... [14:50:55] 3Wikimedia Labs / 3deployment-prep (beta): wrong links on file pages on beta labs for instant commons images - 10https://bugzilla.wikimedia.org/57122#c1 (10Antoine "hashar" Musso (WMF)) 5NEW>3RESO/WON The analysis is correct, the image description comes from production commons and thus reuse whatever tem... [14:52:23] 3Wikimedia Labs / 3deployment-prep (beta): Detect image loading from production, pull those images into beta cluster - 10https://bugzilla.wikimedia.org/61784#c3 (10Antoine "hashar" Musso (WMF)) 5NEW>3RESO/WOR We do have instant commons and the foreignApi backend file system of doom. So in theory all imag... [14:54:08] 3Wikimedia Labs / 3deployment-prep (beta): Caching makes it impossible to test JS changes when logged out - 10https://bugzilla.wikimedia.org/63034#c5 (10Antoine "hashar" Musso (WMF)) Sam, any clue how we invalidate the JS/CSS cache on the bits cache when doing production deployment? I am assuming it expire... [14:58:09] 3Wikimedia Labs / 3deployment-prep (beta): Beta Cluster stopped receiving code updates (beta-update-databases-eqiad hung) - 10https://bugzilla.wikimedia.org/70597#c9 (10Antoine "hashar" Musso (WMF)) That happened again on Oct 23 2014. Looking at Jenkins thread dumps, the deployment-bastion execute thread ar... [14:58:39] 3Wikimedia Labs / 3deployment-prep (beta): Install and configure pool counter - 10https://bugzilla.wikimedia.org/70940 (10Antoine "hashar" Musso (WMF)) p:5Unprio>3Low [14:59:08] 3Wikimedia Labs / 3deployment-prep (beta): monitor unsigned salt keys - 10https://bugzilla.wikimedia.org/70862 (10Antoine "hashar" Musso (WMF)) p:5Unprio>3Low s:5normal>3enhanc [15:00:28] 3Wikimedia Labs / 3deployment-prep (beta): monitor unsigned salt keys - 10https://bugzilla.wikimedia.org/70862#c5 (10Ariel T. Glenn) no autoacceptance in the works? That would take care of the problem. [15:25:53] 3Wikimedia Labs / 3deployment-prep (beta): Beta cluster centralauth accounts points to no longer existing wikis - 10https://bugzilla.wikimedia.org/63396 (10Alex Monk) [15:38:41] on deployment-prep, if I run this query: [15:38:50] show databases where `Database` not in ('arwiki','cawiki','commonswiki','deploymentwiki','dewiki','ee_prototypewiki','en_rtlwiki','enwiki','enwikibooks','enwikinews','enwikiquote','enwikisource','enwikiversity','enwiktionary','eowiki','eswiki','fawiki','hewiki','hiwiki','jawiki','kowiki','loginwiki','metawiki','ruwiki','simplewiki','sqwiki','testwiki','ukwiki','wikidatawiki','zhwiki'); [15:39:16] (that list of wiki DBs is from CentralAuthUser::getWikiList()) [15:39:45] I get some entries back which should be listed by CentralAuthUser::getWikiList().... [15:41:27] At least aawiki, not sure about others [15:43:47] these look like wiki dbs: auth, dewikinews, dewikivoyage, enwikivoyage, labswiki [15:46:55] no idea what/where auth or labswiki are supposed to be. dewikinews/dewikivoyage/enwikivoyage return "No wiki found" at the normal hostnames. [16:10:35] Coren: i-00000113.eqiad.wmflabs and i-00000350.eqiad.wmflabs have ext4 filesystem errors in dmesg; do you care about these? I came across on accidentally and then did a scan, there's only the two [16:11:10] I'd need to know what they are first. [16:11:23] Give me a minute, I'm hunting down a rogue tools process. [16:13:31] ok. (they are: bastion2 and mediawiki2latex) [16:14:14] bastion2 can probably just be rebuilt without a second thought; dunno about mediawiki2latex. [16:15:16] Earwig|away: Ping! [16:19:07] Lemme go see what those errors are. [16:22:55] apergos: Both might be saved with just and fsck; but if not bastion2 can be rebuilt trivially. We'd have to ask the admins of collection-alt-renderer for the other though. [16:23:25] 3Wikimedia Labs / 3Infrastructure: tools-webproxy returns 403 - 10https://bugzilla.wikimedia.org/72481 (10Kai Nissen) 3NEW p:3Unprio s:3normal a:3None when making requests to tools-webproxy within a php script, the server returns 403. it doesn't seem to be a mapping issue, since the requests to tools... [16:24:37] 3Wikimedia Labs / 3Infrastructure: tools-webproxy returns 403 - 10https://bugzilla.wikimedia.org/72481#c1 (10Marc A. Pelletier) This is almost certainly caused by the request not providing a valid User-Agent string. [16:25:38] 3Wikimedia Labs / 3deployment-prep (beta): Beta cluster centralauth accounts points to no longer existing wikis - 10https://bugzilla.wikimedia.org/63396#c6 (10Alex Monk) There's some slightly weird cases around the code I posted. Some quotes from IRC so they aren't missed. on deployment-prep, if I... [16:26:09] 3Wikimedia Labs / 3deployment-prep (beta): Beta cluster centralauth accounts points to no longer existing wikis - 10https://bugzilla.wikimedia.org/63396#c7 (10Alex Monk) (Additionally, centralauth.localnames/localusers seems to have a ton of references to wikis that AFAIK have never been in deployment-prep?) [16:28:08] Coren: ok. do you want a bz report or an email or a ticket... or is just letting you know here enough? [16:30:39] For bastion2, consider me notified. collection-alt-renderer admins might appreciate being notified though. [16:30:52] Anomie seems like the best POC [16:31:00] ok [16:31:06] Jeff_Green also is an admin. [16:31:31] ah ... and in this channel. anomie mediawiki2latex has ext4 fs errors in dmesg, it affects things like dpkg [16:31:37] presumably also apt, don't know how much more [16:33:22] 3Wikimedia Labs / 3Infrastructure: tools-webproxy returns 403 - 10https://bugzilla.wikimedia.org/72481#c2 (10Kai Nissen) 5NEW>3RESO/INV exactly. thank you, coren. [16:34:38] 3Wikimedia Labs / 3deployment-prep (beta): Configure all deployment-prep instances to use local salt and puppet master by default - 10https://bugzilla.wikimedia.org/62795#c5 (10Bryan Davis) Both salt and puppet Coren, apergos: cscott would probably be better to ask about that one. [16:41:00] collection-alt-renderer admins was an mwalker thing, i think. let me check. [16:42:11] cscott: The /var filesystem has some b0rken inodes. I don't know whether we can just rebuild the instance from scratch or if it needs manual cleaning up. [16:42:32] i think all those instances can just go away [16:42:39] Even better [16:42:41] do we archive images, just in case I'm wrong about this? [16:43:12] maybe send email to mwalker and see if he can provide more info. afaik, none of these machines are in use any more. [16:43:48] although mwalker-mwlib was modified on 20 sep 2014? where does that modification date come from? [16:43:54] (i'm looking at https://wikitech.wikimedia.org/wiki/Nova_Resource:Collection-alt-renderer ) [16:44:51] the actual beta/labs ocg machines are in the deployment-prep project [16:45:07] (deployment-pdf01 and deployment-pdf02) [16:46:09] coren, what's your wmf email? i still have great trouble mapping irc nicks to email addresses. [16:46:39] nevermind, i think i figured it out [16:46:49] i'm sending mwalker and email, cc'ing you [16:56:43] cscott: Both mpelletier@ and marc@ work [16:58:08] you got the email? [16:58:43] Aye [17:00:14] ok, we'll wait for mwalker to land his flying car & respond. ;) [17:18:58] <^d> YuviPanda: I got stats in graphite yesterday :) [17:19:18] <^d> I think I might try Automattic's fork though. [17:19:27] <^d> It's 49 commits ahead of Swoop's master ;-) [17:27:02] cscott: regarding the rendering machines; I believe they're still pointed to by betalabs? [17:27:21] betalabs uses deployment-pdf01 in the deployment-prep project. [17:27:39] unless there's some other beta project i haven't been deploying to ;) [17:27:43] mwalker|alt: ^ [17:27:53] nope; that sounds correct :p [17:28:24] in that case, I think the only thing that is still being used in collection-alt-renderer is the OCalm renderer that was written by a member of the community [17:29:23] what machine is that on? [17:29:40] and what does that render to? and who uses/maintains it? [17:30:54] 3Wikimedia Labs / 3deployment-prep (beta): Caching makes it impossible to test JS changes when logged out - 10https://bugzilla.wikimedia.org/63034#c6 (10Krinkle) So far this bug has failed to give any clear definition of what kind of resources this is about. Are we talking about static resources served dire... [17:31:24] 3Wikimedia Labs / 3deployment-prep (beta): Caching makes it impossible to test JS changes when logged out - 10https://bugzilla.wikimedia.org/63034#c7 (10Krinkle) So yeah, what resources are we talking about, in what way are they not being invalidated, and how is it different from production? [17:33:16] cscott: it's Dirk Hünniger; and its the mediawiki2latex instance [17:33:41] I don't think anything points to it; its like any of our other random tools that we have on labs [17:38:52] cscott: I went ahead and just deleted the old instances -- not sure what to do about Dirks instance [17:39:29] so it's the mediawiki2latex instance which has problems. backlog: "mediawiki2latex has ext4 fs errors in dmesg, it affects things like dpkg" [17:39:40] mwalker|alt: do you have contact info for Dirk? [17:41:23] cscott: if you search your email (or gmane) for "MediaWiki to Latex Converter" on wikitech-l, you should find an email from him on Jul 17 [18:00:22] andrewbogott and YuviPanda (thanks andrew for the review), note the salt kkeys cleaup script uses the nova python api, no wikitech or shell script deps, in case you want to crib from it for any reaon [18:00:23] reason [18:00:39] 3Wikimedia Labs / 3Infrastructure: Increase number of Jenkins slaves to spread load and prevent browser test failures on beta - 10https://bugzilla.wikimedia.org/70049#c7 (10Krinkle) I created two extra slaves last month. Bringing us to a total of eight. 4 Ubuntu Precise, 4 Ubuntu Trusty. [18:16:56] hello [18:19:37] Someone mind if they give me nova access? wikitech user Lixxx235 [18:22:35] L235: can you be more specific about what you need? [18:22:46] access to add tools [18:23:06] I'm planning on moving some IRC tools over [18:23:10] to tool labs [18:23:17] ok, I can add you to tool labs. [18:23:21] thanks. [18:24:17] L235: ok, you should be able to log into tools-login now. [18:24:46] Please don't run any long-term jobs on the login host, though, you'll need to use the grid engine to submit persistent things. [18:25:09] ok, got it [18:29:04] http://meta.wikimedia.beta.wmflabs.org/wiki/Special:SpecialPages [18:29:08] Request: GET http://meta.wikimedia.beta.wmflabs.org/wiki/Special:SpecialPages, from 127.0.0.1 via deployment-cache-text02 deployment-cache-text02 ([127.0.0.1]:3128), Varnish XID 281962437 [18:29:08] Forwarded for: 76.103.130.60, 127.0.0.1 [18:29:08] Error: 503, Service Unavailable at Fri, 24 Oct 2014 18:28:47 GMT [18:29:54] (now in -qa) [18:37:24] hey andrewbogott: when I try to add a new tool I'm still getting "You cannot complete the action requested as your user account is not in the project tools". any ideas, or am I doing something wrong? (sorry for the ping) [18:39:31] L235: try logging out and logging in again [18:39:35] ok, thanks [18:39:49] Coren: I can't seem to get mysqldump to work on Tool Labs. Any suggestions for getting that to work (or another way to dump a table schema for cloning)? [18:40:12] kaldari: There's no reason mysqldump shouldn't work. What happens when you try? [18:40:30] Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock [18:41:40] You need to use your replica credentials; either by putting them on the command-line or in .my.cnf [18:41:59] I thought I was [18:42:17] yeah, just double checked... [18:42:18] Also, you can use the replica.my.cnf file directly with --defaults-file= [18:42:50] legoktm: still not working [18:42:57] :/ [18:43:01] Well, you also need to specify a database. /var/run/mysqld/mysqld.sock is the local database (which doesn't exist) and is only used if you don't specify one. :-) [18:43:20] yeah, doing that too.... [18:43:26] What is the exact command line you are using? [18:43:29] mysqldump -u s51999 -p s51999__merge_candidates people_no_date > people_no_date.sql [18:43:46] while acting as the wikidata-game project user [18:44:01] Right, you're missing a -h in there [18:44:07] oops [18:44:26] ''There's'' your problem [18:44:40] :-) [18:47:45] Coren: Looks like it works now. Thanks :) [18:47:52] No worries. [18:57:57] I'm still getting a "Your account is not in the project tools" error [18:58:04] when I try going to https://wikitech.wikimedia.org/w/index.php?title=Special:NovaServiceGroup&action=addservicegroup&projectname=tools [19:16:45] anywhere in particular i should look for logs about exceptions on beta? i checked logstash-beta.wmflabs.org and a `tail -f /data/project/logs/*.log` but not seeing my exception anywhere [19:16:56] i was expecting /data/project/logs/exception.log but not there for some reason [19:17:33] its an api request, if that makes a difference [19:23:12] Coren: it looks like labcontrol2001 is set up as second salt-master for the same instances that virt1000 serves... it had a bunch of unaccepted keys sitting over there so I bulk accepted them for now [19:23:49] I'm not sure what the exact plans for that salt master were; I think andrewbogott_afk didn't mean to keep it, then did. [19:24:01] At any rate, I don't expect it can be harmful to have it working. [19:24:26] ok, he can chime in later bout it... I'm about to afk myself but if we're going to do salt key cleanup out of cron I probably want to do it over there too [19:24:42] hm [19:25:22] have you been involved in any netapp stuff? [19:25:29] any of you guys :) [19:25:42] mutante: Nope. [19:26:34] or sync the directory from virt1000 better, since nova client and creds aren't set up on labcontrol2001 [19:26:55] not in a long time. [19:28:46] iirc there's a limited command set you have access to directly maybe by ssh (?), the rest is done by mounting whichever filesystem you want to work on somewhere else [19:29:43] L235: I'm back now, give me a minute to catch up... [19:32:12] L235: user 'Lixxx235' definitely /is/ in the tools project. I don't know why it's rebuffing you... [19:32:33] apergos, in theory labcontrol2001 will be the salt master for the yet-unborn codfw labs instances [19:32:45] If it is volunteering to be master for eqiad instances then that's... [19:32:56] well, a bug or a feature, take your pick, but it's not on purpose :) [19:33:11] so it turns out, my 500'd web request error message ended up in deployment-mediawiki01:/var/log/hhvm/error.log but not in logstash-beta.wmflabs.org or /data/project/logs/*.log . Seems likely a bug, but not sure what to report against? [19:33:37] we certainly wont be logging into app servers to check hhvm error logs in prod [19:37:54] andrewbogott: yeah, I can see that I'm in the project, but it's saying I'm not... [19:38:10] let me reboot my machine [19:39:32] L235: there's /very/ little chance that rebooting will make a difference :) [19:39:39] Can you tell me what you're doing exactly? [19:39:41] yep, I agree [19:40:09] ok, one sec [19:42:31] 1. sign in 2. go to http://tools.wmflabs.org/ 3. click "Create New Tool" 4. never mind [19:42:37] never mind [19:42:43] the reboot actually did let me in [19:42:58] hm [19:43:04] well, that's baffling, but, ok! [19:43:13] I guess... some cookies or something? [19:43:22] or cache [19:44:23] 3Wikimedia Labs / 3deployment-prep (beta): no log in deployment-bastion:/data/project/logs from "503 server unavailable" on beta labs - 10https://bugzilla.wikimedia.org/72275#c2 (10spage) p:5Normal>3High (In reply to Antoine "hashar" Musso (WMF) from comment #1) > udp2log / python demux script crash fr... [19:51:07] 3Wikimedia Labs / 3deployment-prep (beta): SpecialCite's i18n is still being loaded which is breaking CiteThisPage in BetaLabs - 10https://bugzilla.wikimedia.org/71112 (10Greg Grossmeier) 5PATC>3NEW p:5Unprio>3Normal [19:51:08] 3Wikimedia Labs / 3deployment-prep (beta): SpecialCite's i18n is still being loaded which is breaking CiteThisPage in BetaLabs - 10https://bugzilla.wikimedia.org/71112 (10Greg Grossmeier) 5NEW>3PATC [19:51:54] 3Wikimedia Labs / 3deployment-prep (beta): SpecialCite's i18n is still being loaded which is breaking CiteThisPage in BetaLabs - 10https://bugzilla.wikimedia.org/71112 (10Greg Grossmeier) a:3James Forrester [19:52:08] 3Wikimedia Labs / 3deployment-prep (beta): SpecialCite's i18n is still being loaded which is breaking CiteThisPage in BetaLabs - 10https://bugzilla.wikimedia.org/71112#c4 (10Greg Grossmeier) Planned for Tuesday Oct 28th (this coming Tuesday) [19:52:52] 3Wikimedia Labs / 3deployment-prep (beta): no log in deployment-bastion:/data/project/logs from "503 server unavailable" on beta labs - 10https://bugzilla.wikimedia.org/72275#c3 (10Antoine "hashar" Musso (WMF)) I haven't took time to properly investigate the issue. But recently we had a new udp2log-mw insta... [19:55:21] Coren: two things! [19:55:45] Which are? [19:56:01] 1) I've been messing with 'block-migrate' which works pretty well. It's an almost-live migration that causes a brief interruption in the instance but doesn't require a reboot. [19:56:48] Ah, sorta suspends it in virtualization limbo? [19:56:49] 1a) But there's a stupid bug in the block-migrate code that makes it impossible to migrate to most hosts. Specifically, block-migrate doesn't know to look at the overprovision ratio, so it won't migrate to any host that's over 100% usage (which is all of our nodes.) [19:57:15] The suspension is quite brief, I'm not sure what it's actually doing behind the scenes. [19:57:27] Impressive. [19:58:02] 2) I re-imaged virt1006 and it's up and running compute, but in theory the scheduler won't assign any new instances to it for now. [19:58:02] 2a) Being empty, it's a good candidate to receive block migrations. [19:58:28] So, I'm in the market for instances which we can afford to lose (if virt1006 continues its evil ways) but which we'll notice if they break (thus noticing that virt1006 is continuing its evil ways) [19:58:59] I've already moved the isntance that hosts my IRC bouncer -- that definitely fits the bill. [19:59:07] Maybe we could put an exec node or two over there as well? [19:59:44] Move -exec-12; it's the only Trusty node atm and it has little traffic. [19:59:51] ok! [20:00:01] So if it breaks, it's not a horrible catastropje. [20:00:25] cool, let's see if this works... [20:00:38] Two other good candidates are -shadow and -mail; both of those are noticable if they go away but easy to reboot or rebuild. [20:01:30] (In fact, having -shadow on a virt host where -master definitely isn't is a good idea) [20:01:40] OK, I'm moving exec-12 for starters. There may be a brief network interruption. [20:02:54] If you 'nova show 47608ad4-1adc-4104-b1c5-96281a945ff8' you'll see its status as 'migrating'. Unfortunately wikitech doesn't hook changes in migration status so wikitech won't know about the new host until the instance reboots. [20:07:26] Doesn't seem to be too bad a deal; I look for where an instance is with nova list anyways not wikitech's... fine interface. :-) [20:11:53] 3Wikimedia Labs / 3deployment-prep (beta): no log in deployment-bastion:/data/project/logs from "503 server unavailable" on beta labs - 10https://bugzilla.wikimedia.org/72275#c4 (10spage) (In reply to spage from comment #2) > We also have an instance udplog.eqiad.wmflabs There's also a deployment-fluoride... [20:11:54] PROBLEM - ToolLabs: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: tools.tools.diskspace._var.byte_avail.value (10.00%) [20:15:55] RECOVERY - ToolLabs: Low disk space on /var on labmon1001 is OK: OK: All targets OK [20:20:06] Coren: well, that was a longer network interruption than I predicted… but is -12 looking OK to you? And was sge upset about it being briefly unreachable? [20:20:35] That's what I'm looking at now. [20:21:33] -12 doesn't seem to have noticed its temporary disparition. [20:22:38] And the delay was well within gridengine's grace period. [20:22:59] great! I'll move -shadow now. [20:24:21] Except for being almost totally useless for real-world applications, this is a pretty cool feature [20:24:55] You mean because of the overcommit? [20:25:00] yeah [20:25:29] That's worth a bug upstream I think, being able to, say, --force some-host would be a lifesaver. [20:25:40] It basically means that we need to keep two empty hosts at all time, in case of an evacuation. [20:25:57] Or one empty host with 2x capacity (which we just ordered three of) [20:26:17] There's an open bug about it already, which I added my complaint to. [20:26:18] Yeay procurement. [20:31:12] !log tools moved tools-exec-12, tools-shadow and tools-mail to virt1006 [20:31:17] Logged the message, dummy [22:51:00] (03PS1) 10Dzahn: add fake passwords for iegreview [labs/private] - 10https://gerrit.wikimedia.org/r/168715 [22:52:48] (03CR) 10Dzahn: [C: 032] "need this or can't run puppet-compiler on any change on node zirconium using this.. labs cant find private class" [labs/private] - 10https://gerrit.wikimedia.org/r/168715 (owner: 10Dzahn) [22:53:04] (03CR) 10Dzahn: [V: 032] "need this or can't run puppet-compiler on any change on node zirconium using this.. labs cant find private class" [labs/private] - 10https://gerrit.wikimedia.org/r/168715 (owner: 10Dzahn)