[01:46:59] 3Wikimedia / 3Continuous integration: Set up automatic builds for extensions ported to HHVM - 10https://bugzilla.wikimedia.org/63120#c19 (10Daniel Zahn) how come this is resolved but there is still this unmerged? https://gerrit.wikimedia.org/r/#/c/150813/4 is it unneeded or maybe the bug not really resolve... [02:05:15] 3Wikimedia Labs / 3deployment-prep (beta): [OPS] debianize PHP5 extension 'parsekit' - 10https://bugzilla.wikimedia.org/37076#c15 (10Daniel Zahn) on integration slaves in labs it is attempted to install this package but it fails: Error: /Stage[main]/Contint::Packages/Package[php5-parsekit]/ensure: change fr... [03:04:29] !log Ori made puppet changes that moved the MediaWiki install dir to /srv/mediawiki (https://gerrit.wikimedia.org/r/#/c/159431/). I didn't see that in SAL so I'm adding it here. [03:04:32] Logged the message, Master [03:05:43] where's the icinga ya'll are using? [03:06:42] umm… somewhere? YuviPanda|bzzzz set that stuff up really recently. [03:07:26] well can you give a URL? :) [03:07:51] * bd808 looks for an email about it [03:07:53] and if not then that probably means it should be listed on some wiki page that it's not listed on [03:08:56] It looks like the checks are in the "real" icinga server -- https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=labmon1001&service=Monitor+for+low+disk+space+on+%2Fvar+for+beta+labs [03:09:29] that seems weird [03:10:19] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon1001 [03:10:27] how's about https://icinga.wmflabs.org/cgi-bin/icinga/status.cgi?hostgroup=deployment-prep&style=detail&nostatusheader [03:13:28] That looks like an icinga server all right. I don't know what the difference is. Yuvi and/or mutante would probably know more. [03:28:13] well, one of them is neon (.wikimedia.org). one of them is not [03:28:27] maybe .wmflabs.org is labmon1001. not sure [03:28:57] but QA shouldn't care too much about the checks for the monitoring host. as long as they're green [03:29:05] the QA hosts are the focus :) [08:06:29] 3Wikimedia Labs / 3deployment-prep (beta): [OPS] debianize PHP5 extension 'parsekit' - 10https://bugzilla.wikimedia.org/37076#c16 (10Krinkle) See bug 68256 comment 5. The php jobs currently don't run on the trusty slaves so it missing will not cause immediate problems, and puppet fortunately continues applyi... [08:41:51] zeljkof: hi! Finally we're around at the same time :) [08:42:02] spagewmf: hi :) [08:42:27] int't is 1:40am for you? :) [08:43:01] zeljkof: all the Flow and Echo builds 5 hours ago failed :( . The ones I've looked at all failed in a few seconds with a failure "File "/srv/deployment/integration/slave-scripts/bin/mw-api-siteinfo.py", line 78, in main" [08:43:51] yup, late night. Anyway is ^ a known issue? I can file a bug before I go to sleep. E.g. https://integration.wikimedia.org/ci/job/browsertests-Echo-en.wikipedia.beta.wmflabs.org-linux-chrome-sauce/lastBuild/console [08:44:23] spagewmf: let me take a look [08:45:20] spagewmf: looks like the API changed, and the script that checks the git branch fails because of that [08:45:27] please report the bug [08:48:10] zeljkof: maybe it's been fixed? The cirrus browser test, run 3 minutes later, passed. [08:48:22] spagewmf: hm [08:48:29] does is help if you rerun the tests? [08:49:25] zeljkof: I just kicked off an Echo browser test. But I'll file the bug anyway [08:49:41] spagewmf: please do [08:52:38] (03PS2) 10Zfilipin: WIP sorting browser tests jobs alphabetically [integration/jenkins-job-builder-config] - 10https://gerrit.wikimedia.org/r/159343 [08:52:54] new browsertest build died quickly in the same place, so not fixed [08:53:07] spagewmf: something is going on on beta then [08:53:21] my guess is that the api changed, and the script has to be updated [08:58:10] who knows beta? [08:58:36] i was trying to follow instructions from bug 70597 but i'm not sure i have the right perms [08:58:52] also, something weird on the apaches. the docroots don't exist [08:59:46] $ fgrep DocumentRoot /etc/apache2/sites-enabled/02-main.conf [08:59:46] DocumentRoot "/srv/mediawiki/docroot/wikipedia.org" [08:59:47] DocumentRoot "/srv/mediawiki/docroot/wikibooks.org" [08:59:47] $ stat /srv/mediawiki/docroot [08:59:47] stat: cannot stat ‘/srv/mediawiki/docroot’: No such file or directory [09:02:53] zeljkof: yes, the actual failure is beta labs be dead [09:03:31] 3Wikimedia / 3Continuous integration: browser tests all failing early in mw-api-siteinfo.py call , beta-labs api.php returns 404 - 10https://bugzilla.wikimedia.org/70648 (10spage) 3NEW p:3Unprio s:3major a:3None All the Echo and Flow tests failed starting 5 hours ago. The ones I've looked at failed... [09:04:05] api.php, load.php, index.php, all give 404s [09:04:53] spagewmf: i already diagnosed that... [09:05:50] jeremyb: thanks (/me looks in IRC) umm so what's up? [09:06:15] well see the paste in here above [09:06:19] from my shell [09:06:37] i don't have a clue how things normally work but that can't be right [09:07:31] also, [08:58:36] i was trying to follow instructions from bug 70597 but i'm not sure i have the right perms [09:16:04] spagewmf [09:19:03] jeremyb: yes? (I'm looking at deployment-mediawiki02, I dunno how it's normally set up but yeah /srv/mediawiki (and /a/common, and /usr/local/apache/common) look wrong [09:20:49] I would guess some sync process went wrong. Anyway, 2:20am here. Thanks jeremyb for looking at it. [09:21:29] so, apparently prod was converted from /usr/local to /srv/mediawiki [09:21:36] not sure if beta got on that train [09:24:10] jeremyb: https://gerrit.wikimedia.org/r/#/c/159431/ "beta: switch to /srv/mediawiki" merged today. [09:24:22] got on the train, rode off the tracks :) [09:27:13] well how do i normally deploy? [09:27:18] if i wnat to test a change? [09:27:26] or i have to let jenkins deploy for me? [09:28:43] i wonder if maybe puppet was stopped someplaces but not others [09:34:01] jeremyb: I don't understand your questions. deploying a patch doesn't depend on beta labs working. Meanwhile some process updates beta labs with git master every few minutes, and every 12 hours jenkins runs browser tests against beta labs. [09:34:07] i think i figured it out [09:35:59] hmm, ori's comment in that gerrit is "I made /srv/mediawiki be a symlink to [09:36:06] right [09:36:10] /srv/common-local". The latter has all the expected stuff [09:36:11] but puppet failures [09:36:31] and /srv/mediawiki isn't a symlink to it. [09:38:05] Sep 10 05:51:58 deployment-salt kernel: [8255978.330110] Killed process 8394 (puppet) total-vm:1647904kB, anon-rss:1487172kB, file-rss:2416kB [09:40:43] jeremyb: so if puppet succeeds the symlink will be set up and beta labs will work? I hope so, good night! [09:41:41] there also was a full disk somewhere at some point [09:42:00] 3Wikimedia / 3Continuous integration: browser tests all failing early in mw-api-siteinfo.py call , beta-labs api.php returns 404 - 10https://bugzilla.wikimedia.org/70648#c1 (10spage) Everything on http://en.wikipedia.beta.wmflabs.org/ is a 404, index.php, load.php as well. jeremyb noticed /srv/mediawiki on... [09:42:15] never a good sign :) Again thanks for working on this [09:42:20] zzz [09:42:30] spagewmf: what about jenkins perms? [09:42:42] do those need granting or everyone has the same rights? [09:42:44] anyway, nachgt [09:42:46] nacht* [10:13:46] zeljkof_: not entirely back yet [10:14:11] jeremyb: are you working on it? [10:14:18] yeah [10:14:22] jeremyb: great :) [10:14:40] jeremyb: please ping the qa list and update the bug when you are done [10:14:45] or just ping me and I will do it [10:14:53] !log deployment-bastion killed puppet lock [10:14:55] Logged the message, Master [10:15:09] !log deployment-salt started puppetmaster && puppet run [10:15:11] Logged the message, Master [10:15:27] !log deployment-mediawiki0[12] both had good puppet runs [10:15:29] Logged the message, Master [10:16:34] !log deployment-salt had an oom-kill recently. and some box (maybe master, maybe client?) had a disk fill up [10:16:36] Logged the message, Master [10:16:51] making use of my new bot huh [10:16:52] :P [10:17:07] !log deployment-bastion good puppet run [10:17:09] Logged the message, Master [10:17:23] jeremyb: yes, I do not remember seeing qa-morebots around before :) [10:17:37] zeljkof: i added him to the channel [10:20:02] !log deployment-bastion /var at 97%, freed up ~500MB. apt-get clean && rm -rv /var/log/account/pacct* [10:20:04] Logged the message, Master [10:21:47] zeljkof: (you know where he logs to, right?) [10:21:54] jeremyb: no [10:21:56] qa-morebots [10:21:56] I am a logbot running on tools-exec-08. [10:21:56] Messages are logged to https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL. [10:21:56] To log a message, type !log . [10:38:56] anyone happen to know why commons beta is down? http://commons.wikimedia.beta.wmflabs.org [10:40:42] dan-nl_: jeremyb is working on it, I think [10:41:09] https://bugzilla.wikimedia.org/show_bug.cgi?id=70648 [10:43:42] not sure why it went 404 again... [12:05:35] (03PS3) 10Zfilipin: Sort notification-emails in browsertests.yaml alphabetically [integration/jenkins-job-builder-config] - 10https://gerrit.wikimedia.org/r/159334 [12:09:58] (03CR) 10Zfilipin: "recheck" [integration/jenkins-job-builder-config] - 10https://gerrit.wikimedia.org/r/159343 (owner: 10Zfilipin) [12:10:13] (03PS3) 10Zfilipin: WIP sorting browser tests jobs alphabetically [integration/jenkins-job-builder-config] - 10https://gerrit.wikimedia.org/r/159343 [12:18:27] zeljkof: ping [12:18:45] jeremyb: pong :) [12:19:00] saw above [08:58:36] i was trying to follow instructions from bug 70597 but i'm not sure i have the right perms [12:19:01] ? [12:19:11] still not sure if i have the right bits for that [12:19:38] jeremyb: uh [12:19:50] I have no clue what is going on :) [12:19:59] sorry, have to go, will be back in 30 minutes or so [12:20:02] how can i disable slaves? [12:20:27] jeremyb: I do not know, bd808|BUFFER should know, or hashar [12:20:36] but both of them are not online [12:20:41] * zeljkof is out to lunch [12:20:47] k [13:25:29] (03Abandoned) 10Zfilipin: WIP sorting browser tests jobs alphabetically [integration/jenkins-job-builder-config] - 10https://gerrit.wikimedia.org/r/159343 (owner: 10Zfilipin) [13:40:33] brb [13:41:13] does YuviPanda maybe know about jenkins? [13:41:20] (asked above) [13:41:20] nope :( [13:41:25] :( [13:41:27] hashar or krinkle [13:41:40] or bd808|BUFFER!!! [13:41:55] yeah [14:01:35] (03PS1) 10Zfilipin: Sort browsertests jobs [integration/jenkins-job-builder-config] - 10https://gerrit.wikimedia.org/r/159474 [14:03:49] weird issue on both test.wp and the beta cluster. when i try to import (either interwiki or xml), i get " Import failed: Expected tag, got ", but the same import works fine on my own server [14:06:43] (03PS1) 10Zfilipin: WIP Trying to use variable for VisualEditor job [integration/jenkins-job-builder-config] - 10https://gerrit.wikimedia.org/r/159475 [14:29:44] jackmcbarn: beta's still not stable [14:30:03] jeremyb: testwiki has the exact same problem, and it's on the production cluster [14:30:21] ok, then fine :) [14:44:36] * bd808 sees a lot of pings in backscroll [14:45:45] jeremyb: Still some ongoing issue with beta? http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page LGTM [14:50:22] hi zeljkof do you know this error "ValueError: Extra data: line 1 column 4 - line 1 column 18 (char 4 - 18)" https://integration.wikimedia.org/ci/view/BrowserTests/view/-All/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/261/console [14:50:34] zeljkof: I am re-running that build now to see if it repeats [14:51:01] did not repeat [14:51:23] jeremyb: What is this local commit on deployment-salt with title "checkpoint"? Looks like it fixes an error in a manifest and also kills some of _joe_'s work with hhvm? [14:54:10] beta labs API coughed up a db error just now but did not repeat that either. [14:58:06] chrismcmahon: https://bugzilla.wikimedia.org/show_bug.cgi?id=70648 [14:58:11] probably that bug [14:58:20] jeremyb was working on fixing the problem [14:58:26] not sure what the status is [14:58:43] beta was down, looks like it is back up [14:58:49] that's the one. [15:13:35] bd808: hey [15:13:48] wikitechwiki's totally broken for me [15:14:03] bd808: i'm pretty sure i didn't lose anyone's work... look closely? [15:14:20] even logged out wikitech is broken [15:16:12] i'm now in the middle of fighting with trebuchet on jobrunner01 [15:16:42] now would be a pretty good time to be able to see wikitech! [15:18:33] oh, i see it was unplanned! [15:18:38] (reading #-operations) [15:20:28] jeremyb: wikitech works for me [15:20:36] maybe now it does [15:20:38] not before [15:20:40] maybe this would help [15:20:40] https://web.archive.org/web/*/http://wikitech.wikimedia.org [15:20:45] if it is down [15:21:01] no. i need to check a puppet variable [15:24:18] bd808: so, do you think i lost anything or not? [15:25:33] jeremyb: Well stuff is commented out in that commit and it's not in gerrit [15:25:49] like what? [15:26:00] apache::conf { 'hhvm_catchall': [15:26:06] right [15:26:16] that was a duplicate resource so it wouldn't compile at all [15:26:31] so it's still taking effect, just from a different manifest [15:26:52] Ah. Ok, but we should fix "for real" -- http://paste.debian.net/120235/ [15:27:09] So we should take that commit, push to gerrit and get it merged [15:27:26] first i was trying to get stuff working at all [15:27:30] sure [15:27:31] there's other stuff to get merged [15:28:23] http://paste.debian.net/120236/ -- all but the 2 [LOCAL HACK] commits are in gerrit waiting for an opsen to approve [15:28:26] https://git.wikimedia.org/commitdiff/operations%2Fpuppet.git/79d0ec61a89542de01b6f3689adeb11aa6a43066 [15:28:45] really? [15:29:02] i think not :) [15:29:17] 3Wikimedia Labs / 3deployment-prep (beta): Import failed Expected tag, got - 10https://bugzilla.wikimedia.org/70658 (10dan) 3NEW p:3Unprio s:3normal a:3None commons beta is having a problem importing some templates from commons; e.g. Template:Artwork, Template:Map. steps to reproduce --... [15:29:45] jeremyb: You think not? [15:30:04] the one i just linked is not waiting for review? [15:30:07] AFAIK [15:30:36] So you didn't push it either? [15:31:23] i pushed to sandbox [15:31:46] The workflow hashar and I have been using is at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#Cherry-picking_a_patch_from_gerrit [15:31:46] i even pushed a wrong version first, tested and then pushed a working version [15:32:17] that's not much different than what i did [15:32:25] it's just not a published changeset [15:32:35] nor a draft [15:33:02] Yeah which means it is diverging from the prod branch [15:33:13] well it would be anyway [15:33:29] Rather than just waiting forever for a root to review and merge [15:33:37] * jeremyb goes back to hacking... [15:35:45] It would only be diverging until it is merged. The git repo on deployment-salt is rebased on the upstream production branch head once an hour. Things that get merged there become "official" and most importantly can be seen by anyone working on puppet refactoring and new patches. [15:36:25] I totally appreciate you helping get things running again [15:37:04] Now I want to get the fixes upstreamed so it doesn't happen again as soon as the next apache change patch is merged [15:39:22] (03CR) 10Bartosz Dziewoński: "Needs rebase." [integration/zuul-config] - 10https://gerrit.wikimedia.org/r/159414 (owner: 10Legoktm) [15:47:14] 3Wikimedia Labs / 3deployment-prep (beta): Import failed Expected tag, got - 10https://bugzilla.wikimedia.org/70658#c1 (10Jackmcbarn) This also happens when doing interwiki imports from en.wikipedia.org to test.wikipedia.org. Note that the export files themselves are fine, as I was able to import... [15:52:44] 3Wikimedia Labs / 3deployment-prep (beta): Beta Cluster api.php, index.php, load.php return 404 (caused failed browser tests) - 10https://bugzilla.wikimedia.org/70648 (10Greg Grossmeier) p:5Unprio>3Highes [15:53:58] hah, wikitech is still broken [15:54:10] was faster for me to figure out how to get what i needed with `ldapsearch` [15:56:42] yeah, so i can't fix jobrunner until wikitech is back [15:56:48] it's misconfigured in ldap [15:56:56] puppetVar: deployment_server_override=deployment-scap.eqiad.wmflabs [15:57:02] and that host doesn't even exist [15:57:25] i tried manually setting the grain and puppet "fixed" it back to the broken value [15:57:27] :D [16:12:43] jeremyb: so status of the 404s on beta cluster seem to be better? https://bugzilla.wikimedia.org/show_bug.cgi?id=70648 [16:13:06] greg-g: should be... still working on stuff that's breaking jenkins [16:13:33] kk, I'll wait a bit until the final fix (re what bd808 said above) is merged [16:14:31] but jenkins should have been broken [16:14:40] some of this stuff didn't just break because of a merge [16:14:55] e.g. wikitech puppetvar pointing at an nxdomain [16:16:46] greg-g: VE build back to green, so beta+Jenkins is at least functioning: https://integration.wikimedia.org/ci/view/BrowserTests/view/-All/job/browsertests-VisualEditor-en.wikipedia.beta.wmflabs.org-linux-firefox-sauce/ [16:17:21] well not all of beta has code updates [16:17:29] so maybe a bad test :) [16:18:24] so, what is deployment-mediawiki03? [16:18:27] it's not in varnish [16:18:33] "11:37 < bd808> Now I want to get the fixes upstreamed so it doesn't happen again as soon as the next apache change patch is merged" is the part I'm worried about/waiting on ;) [16:18:51] greg-g: well one is waiting for *your* review... [16:22:03] also, what's saio? [16:23:51] jeremyb: oh, who told me? [16:23:55] jeremyb: beta fixed \o/ Did you work all night?! [16:24:23] jeremyb: which review? [16:24:43] spagewmf: essentially [16:24:52] greg-g: well maybe you ignore those emails? :) [16:24:56] mostly yes [16:25:22] https://gerrit.wikimedia.org/r/159487 [16:25:23] the amount of work I do in gerrit is limited to approving things, what do you need, can you give me a link? [16:25:26] ty [16:25:38] i think it's fairly simple! [16:26:11] if I knew the code base/knock on effects, sure, which I don't [16:26:15] :) [16:26:51] well it's in the beta module [16:27:04] so it's very unlikely that prod would rely on it. if so than that's a bug [16:27:36] the other people on the list are good reviewers [16:28:27] sorry, I'm only partially aware of the symlink changes going on, you could have the opposite change there and tell me the same thing and I'd only have to believe you [16:28:48] hahaha [16:28:59] well that's kinda orthogonal [16:29:19] not sure how [16:29:27] could effect the order (new vs. old). but either way beta should mean not prod [16:29:39] I have no idea what that sentence means [16:29:51] (in this context) [16:30:32] the file i changed is called modules/beta/files/LocalSettings.php [16:30:57] if anything under b/modules/beta can effect prod then that's a bug [16:31:00] IMHO [16:31:06] err [16:31:11] s|b/|| [16:31:42] and i have more jenkins errors to look at \o/ [16:33:30] jeremyb: I'm not worried about effecting prod [16:33:48] I'm not the/a judge of what is a smart puppet change for Beta Cluster [16:34:11] does that make sense now? [16:34:20] aha! [16:34:21] ok [16:34:23] :) [16:35:09] hrmmmm. so who should i ask about jobrunners [16:35:38] who ever you would ask about jobrunners for prod :) [16:36:36] ok :) [17:02:25] jeremyb: to make it clear though, thanks a ton for taking on the issue last night [17:03:26] greg-g: so, have some more questions. mostly because i don't know enough about beta [17:03:34] e.g. should jobrunner be hhvm? [17:03:38] web is, right? [17:03:52] jobrunner is currently zend I think [17:06:25] also, working more on dealing with the jobrunners not getting mediawiki issue [17:06:32] first attempt at monkeypatch didn't work [17:06:46] and brb [17:07:07] jeremyb: I wouldn't switch anything else to hhvm without ori knowing/doing it/watching it [17:07:20] right... [17:08:09] greg-g: can you answer the earlier question on mediawiki03? [17:08:12] can i nuke that? [17:08:17] or make a mediawiki04? [17:08:45] (varnish only has 01 and 02 atm) [17:11:06] whoa, 8192MB ram. these beta boxes are beefy! [17:11:58] * jeremyb is making a mediawiki04 [17:13:23] what an informative error msg "Failed to create instance. " [17:14:17] jeremyb: mw03 is for the security researcher's test, he sets a cookie and his requests are routed to that host [17:14:27] huh [17:14:30] are you sure? :) [17:14:35] it's not in varnish... [17:14:42] it worked last week [17:14:52] marxarelli would answer, if he were online [17:14:59] he set it up (dduvall) [17:15:25] jeremyb: https://bugzilla.wikimedia.org/show_bug.cgi?id=70181 [17:15:41] https://bugzilla.wikimedia.org/show_bug.cgi?id=70181#c5 [17:16:36] how weird [17:18:13] greg-g: well it's definitely not there on the box [17:19:11] there's a bunch of recent successful puppet runs [17:25:05] yeah, not working for me, I set the cookie and am getting mw02 [17:26:49] right [17:27:00] well very easy for me to do if you want [17:27:15] does he have plans to test in the very near future? [17:28:11] https://gerrit.wikimedia.org/r/#/c/158016/ [17:28:31] right, i saw [17:28:33] it was working for a bit, I had graphs showing mw03 cpu going way up as he was hitting it [17:28:52] probably a local hack and now gone due to puppet runs? :/ [17:29:30] ohhhh [17:29:33] it's not merged! [17:29:46] what am i thinking. i was searchcing for it in my local (laptop) clone [17:29:49] :P [17:30:02] (but it's also not on the server) [17:30:06] let's see about puppet [17:37:15] oh, double silly me. i didn't log in to normal varnish at all yet. just bits varnish [17:41:55] heh [17:49:29] greg-g: i think cookie does work. how did you disprove it? [17:49:36] i was just looking on the wrong box [17:49:46] (i tested) [17:49:51] it works [17:53:16] hhmmmm, lemme re-try [17:53:57] ok, so i assume saio is swift all in one [17:54:12] which i guess is not a good way to replicate prod! [17:54:19] what box can i kill? [17:55:55] why do you need to kill a box? [17:56:45] well, odd things from my view re cookie: I set the "security_audit=1" cookie and I get errors on Special:RecentChanges and :Random, setting it to =0 makes them work [17:57:00] jeremyb: before you kill things, can you wait until marxarelli is online, please? [17:57:57] mediawiki03 isn't in the scap pool yet I just noticed. [17:58:04] so it has stale code [17:58:18] eek [17:58:27] * bd808 thinks this crap is too complicated [17:58:44] I'll ping Dan later and tell him how to fix it [17:58:44] 3Wikimedia Labs / 3deployment-prep (beta): Setup a mediawiki03 (or what not) on Beta Cluster that we can direct the security scanning work to - 10https://bugzilla.wikimedia.org/70181#c9 (10Greg Grossmeier) 13:57 < bd808> mediawiki03 isn't in the scap pool yet I just noticed. 13:58 < bd808> so it has... [17:58:49] greg-g: i want to kill a box because i want to make anew box. i want to test puppet runs from clean state [17:59:01] just make a new box then? [17:59:15] greg-g: we're at quota. and to complicate things labs is having some nfs issues [17:59:24] ask for more quota [17:59:30] seriously [17:59:56] i did already. see what the response was? good timing :) [18:00:21] I think we went on the box killing hunt when siao was added and decided that we needed all of them [18:01:01] maybe we should consider docker for cluster #2 [18:01:08] for another time :) [18:07:04] jeremyb: so, I'm still lost why you need to create a new instance, I missed the reason? [18:07:18] there's a sever lack of bug ticket updates on this ;) [18:09:14] > i want to test puppet runs from clean state [18:09:58] (quoting myself) [18:13:47] mediawiki03 was built from scratch last week, so ... [18:15:31] but i didn't do it myself and i don't think greg-g wants me breaking it [18:15:45] i should get some lunch [18:17:49] jeremyb: when you get back, can you write up your current thinking of what's going on on that original bug report? just so everyone can catch up and help? thanks. [18:19:23] i didn't leave [18:21:01] greg-g, bd808|LUNCH: sorry, missed some of the irc banter around mediawiki03. is the scap pool for beta labs something i can manage? [18:21:29] jeremyb: wasn't sure if you needed to get out of a chair to get your lunch or not [18:21:33] :) [18:23:08] trebuchet style scap should be working [18:23:09] jeremyb: request still stands then, just remove the first "when you get back, " part :) [18:23:39] but beware, bits and upload and maybe other pieces probably don't go to security instance [18:23:45] if you want to test those too... [18:25:58] what all are you doing (reasking my request)? I don't see anything else here: https://gerrit.wikimedia.org/r/#/q/owner:%22Jeremyb+%253Cjeremy%2540tuxmachine.com%253E%22,n,z [18:26:01] jeremyb: ^ [18:27:23] i tested a while ago https://git.wikimedia.org/commitdiff/operations%2Fpuppet.git/0e4d2e4bceb442ea039adbba2a7912287ce092c3 [18:27:27] but that didn't help [18:27:47] jobrunner never got mediawiki codebase [18:27:52] which it's supposed to have [18:28:05] so i'm going to figure out with this new box how mediawiki gets to the new box [18:30:48] 3Wikimedia / 3Continuous integration: CI browser test dashboard takes 100 seconds to appear on first load - 10https://bugzilla.wikimedia.org/70671 (10spage) 3NEW p:3Unprio s:3normal a:3None In the last few days, when I visit https://integration.wikimedia.org/ci/view/BrowserTests/ it redirects to htt... [18:31:32] ok, puppet run is going. back with lunch [18:31:43] greg-g: is that clear? [18:31:57] maybe will bring irc with me though :) [18:33:05] hah, that was wishful thinking [18:39:47] marxarelli: The host needs to be added to https://github.com/wikimedia/operations-puppet/blob/production/modules/beta/files/dsh/group/mediawiki-installation [18:40:31] marxarelli: So just do the normal gerrit patch and cherry-pick for it. mutante should be good for a merge [18:42:00] jeremyb: Mediawiki gets to a box via scap. Specifically this Jenkins job -- https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [18:42:12] noooooo [18:42:12] which is apparently totally boked [18:42:27] bd808: got it [18:42:37] https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ [18:42:39] puppet has to set things up first before that jenkins job can work [18:42:42] rsync target directory /srv/mediawiki not found [18:42:47] right [18:43:13] hrmmm [18:43:22] although.... [18:43:34] you gave me an idea [18:43:37] Or made symlinks that one looks to be missing [18:43:56] *ori [18:44:12] I think the gerrit patch said "not in puppet yet" [18:45:39] !log Made /srv/mediawiki a symling to /srv/common-local on deployment-jobrunner01 [18:45:41] Logged the message, Master [18:48:20] doh [18:48:23] is it that simple? [18:48:34] are you puppeting or should i? [18:49:21] I think Ori has been working on this. We should be getting rid of /src/common-local and provisioning /srv/mediawiki directly [18:49:31] srv* [18:49:47] yeah. sorry. typing is hard today [18:49:56] i mean puppet run with existing puppet state [18:50:06] not to patch in the missing bit [18:50:23] and then when puppet's done we do jenkins build [18:51:24] Something wonky is happening during scap on that host. Continuing to debug. [18:53:35] !log Removed symlink and make /srv/mediawiki a proper directory on deployment-jobrunner01; Running sync-common to populate. [18:53:36] Logged the message, Master [18:53:57] jeremyb: So is puppet failing to run on that host too? [19:00:36] grrr, i hate you puppet [19:00:39] !log sync-common finished on deployement-jobrunner01; trying Jenkins scap job again [19:00:41] Logged the message, Master [19:00:46] now i have to rekey a bunch of hosts... [19:03:12] !log w00t!!! scap jobs is green again -- https://integration.wikimedia.org/ci/job/beta-scap-eqiad/20965/ [19:03:14] Logged the message, Master [19:03:28] woot [19:04:31] !log Changed /usr/local/apache/common-local symlink to point to /srv/mediawiki on deployment-jobrunner01 [19:04:32] Logged the message, Master [19:05:17] !log Deleted /srv/common-local on deployment-jobrunner01 [19:05:19] Logged the message, Master [19:11:09] !log /var full on deployment-jobrunner01 [19:11:12] Logged the message, Master [19:12:01] fscking small /var labs partitions [19:14:18] bd808: [10:20:02] !log deployment-bastion /var at 97%, freed up ~500MB. apt-get clean && rm -rv /var/log/account/pacct* [19:14:22] !log Deleted /var/log/mediawiki/jobrunner.log and restarted jobrunner on deployment-jobrunner01: [19:14:24] Logged the message, Master [19:15:35] jeremyb: Cool. What's /var/log/account/pacct* > [19:15:39] *? [19:15:47] process accounting [19:15:57] afaik [19:19:06] 412M of my tiny 1.2G avail :( [19:19:06] !log Deleted /var/log/account/pacct* and /var/log/atop.log.* on deployment-jobrunner01 to make some temporary room in /var [19:19:08] Logged the message, Master [19:25:46] jeremyb: Umm is "Could not retrieve catalog from remote server: SSL_connect returned=1 errno=0 state=SSLv3 read server session ticket A: sslv3 alert certificate revoked" related to "< jeremyb> now i have to rekey a bunch of hosts..."? [19:30:16] !log Removed old mw-job-runner cron job on deployment-jobrunner01 [19:30:30] Logged the message, Master [19:32:09] !log Killed jobs-loop.sh tasks on deployment-jobrunner01 [19:32:11] Logged the message, Master [19:32:54] bd808: yeah, doing it now [19:33:31] bd808: bad docs or i'm tired or something. but i have a loop to fix:) [19:37:45] !log Deleted old /srv/common-local on deployment-videoscaler01 [19:37:47] Logged the message, Master [19:38:14] !log Made /usr/local/apache/common-local a symlink to /srv/mediawiki on deployment-bastion [19:38:16] Logged the message, Master [19:38:39] !log Fixed beta-recompile-math-texvc-eqiad job on deployment-bastion [19:38:41] Logged the message, Master [19:44:06] * bd808 retires from playing opsen for the the days and goes back to coding SUL stuff [19:54:28] before I go to my next call, are things all good jeremyb ? [19:59:28] greg-g: errr, almost. should be 5 more mins [19:59:55] for the certs [20:00:05] i haven't looked at jenkins in a while [20:00:36] jeremyb: can you do a write up of what happened/everything you fixed? the outage was big enough to warrent an incident report, but I don't have all of the information. [20:01:20] greg-g: yeah... [20:01:35] ty [20:19:20] i can't log in to deployment-parsoidcache01.eqiad.wmflabs. it's been up since ~march and OOM killer was invoked at some point [20:19:26] should i reboot it? [20:19:30] can someone else log in? [20:21:21] everything else should have puppet fixed but i'm double checking [20:31:22] jeremyb: I get "Permission denied (publickey)" from deployment-parsoidcache01.eqiad.wmflabs which usually means that nfs has gone awol [20:31:35] So I'd say just restart it via wikitech [20:41:43] ok [20:41:45] booting [20:42:36] depending on the interface, it can be really careful about getting confirmation to boot or not [20:45:02] marxarelli: https://www.mediawiki.org/w/index.php?title=Wikimedia_Release_Engineering_Team%2FVagrant_survey&diff=1147859&oldid=1138286 [20:50:30] 3Wikimedia Labs / 3deployment-prep (beta): Beta Cluster api.php, index.php, load.php return 404 (caused failed browser tests) - 10https://bugzilla.wikimedia.org/70648#c2 (10Antoine "hashar" Musso) mw-api-siteinfo.py is in the repository integration/jenkins.git and should probably have better error handling w... [20:51:06] greg-g: looks good. that "would you recommend" question will be our hard metric [20:52:13] marxarelli: /me nods [20:53:27] who is jeremyb, where is jeremyb, and does he ever sleep? [20:54:15] sooon i need to [20:54:23] volunteer opsen, east coast, and not enough information to give a firm answer [20:54:32] hah [20:54:43] greg-g: so, i'm going to write something on teh train [20:54:43] :) [20:54:56] jeremyb: that'd be great, thank you [20:55:00] So was the root cause of most of the brokenness /var filling up on multiple hosts? [20:55:07] If so *RAGE* [20:55:09] doubtful [20:55:16] i didn't see anything full firsthand [20:55:34] there was a puppet killed by oomkiller [20:55:40] not sure if it was a master or not [20:55:50] and a disk did full up but not while i was watching [20:56:15] jobrunner had a full /var but other problems that helped that along [20:56:36] The half baked move to /srv/mediawiki was it's biggest issue [20:57:07] Sep 10 05:51:58 deployment-salt kernel: [8255978.329157] Out of memory: Kill process 8394 (puppet) score 371 or sacrifice child [20:57:07] Sep 10 05:51:58 deployment-salt kernel: [8255978.330110] Killed process 8394 (puppet) total-vm:1647904kB, anon-rss:1487172kB, file-rss:2416kB [20:57:32] That's not goodly [20:57:34] and 8394 was in fact the puppetmaster [21:06:19] deployment-sandbox.eqiad.wmflabs: Error: /bin/mkdir -p /var/cache/pbuilder/aptcache; cowbuilder --create --aptcache '/var/cache/pbuilder/aptcache' --buildplace '/var/cache/pbuilder/build' --distribution lucid --basepath /var/cache/pbuilder/base-lucid.cow --components 'main universe' --othermirror 'deb http://aError: Could not run Puppet configuration client: No space left on device - [21:07:02] ok, bye :) [21:53:21] CRIT: deployment-prep.deployment-videoscaler01.puppetagent.failed_events.value [21:53:24] (puppet fail on beta on videoscaler) [21:53:37] per new icinga checks [22:04:58] 3Wikimedia / 3Continuous integration: CI browser test dashboard takes 100 seconds to appear on first load - 10https://bugzilla.wikimedia.org/70671 (10Andre Klapper) [23:12:40] (03PS1) 10Damienkan: Add Rubocop files. [ruby/api] - 10https://gerrit.wikimedia.org/r/159630