[00:00:27] indeed it isn't [00:17:00] is the job queue on beta labs stuck again? [00:28:29] 3Wikimedia Labs / 3deployment-prep (beta): beta labs job queue stuck (again) - 10https://bugzilla.wikimedia.org/70374 (10Kunal Mehta (Legoktm)) 3NEW p:3Unprio s:3major a:3None http://en.wikipedia.beta.wmflabs.org/wiki/Special:GlobalRenameProgress/ZFilipin_%28WMF%29 had a bunch of jobs queued that wer... [00:28:56] 3Wikimedia Labs / 3deployment-prep (beta): beta labs job queue stuck (again) - 10https://bugzilla.wikimedia.org/70374 (10Kunal Mehta (Legoktm)) p:5Unprio>3High [00:29:03] bd808: ^ [00:30:54] legoktm: blerg. I'll see if the jobrunner is dead [00:31:42] ack! we are still running jobs-loop.sh in beta! [00:35:18] !log deployment-prep Puppet run on deployment-jobrunner01 failing with what seem to be dns issues (getaddrinfo: Name or service not known) [00:35:39] :< [00:35:59] I thought Coren had a fix for dns issues? [00:36:10] also the !log bot is down [00:36:52] petan: Have time to fix the logging bot again? [00:37:28] legoktm: He has fixed them in other places. I'm not sure what's up here. Just restarted nscd and it didn't seem to help [00:37:49] ok [00:41:58] andrewbogott: Got a minute to revive morebots? petan restarted it earlier today but it seems to be awol again. [00:42:07] bd808: sure [00:43:40] log deployment-prep Puppet run on deployment-jobrunner01 failing with what seem to be dns issues (getaddrinfo: Name or service not known when Trebuchet is running) [00:43:46] !log deployment-prep Puppet run on deployment-jobrunner01 failing with what seem to be dns issues (getaddrinfo: Name or service not known when Trebuchet is running) [00:44:16] * bd808 saw nick and didn't notice direction [00:44:26] !log deployment-prep Puppet run on deployment-jobrunner01 failing with what seem to be dns issues (getaddrinfo: Name or service not known when Trebuchet is running) [00:44:39] !log [00:45:10] labs-morebots, you ok? [00:45:10] I am a logbot running on tools-exec-08. [00:45:10] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [00:45:10] To log a message, type !log . [00:45:16] y'know, it's trying to talk to wikitech. [00:45:31] probably we broke its connection somehow [00:45:50] legoktm: I don't have a fix for dns issues; I did something that was a plausible improvement. [00:46:09] oh, ok [00:47:31] legoktm: But also, I don't know whether that puppet error /is/ broken dns either. :-) [00:47:47] Coren: Trill trying to figure that out myself [00:48:31] Coren: But on the plus side your bandaid has helped the beta scap job measurably. No failures for dns since you made the change [00:49:22] bd808: Do one thing for me though; on the box where the error occured, can you do a 'sudo nscd -g' and see if the 'hosts cache' has both 'yes cache is enabled' and a positive number of 'cache hits on positive entries'? [00:50:30] hosts cache: 3545 cache hits on positive entries [00:50:44] Good, so that works at least. [00:50:58] I think this is a salt/trebuchet issue [00:51:11] and possibly wikitech related but not sure of that yet [00:55:58] !log testlabs test this is testing a test [00:59:51] bd808: logging is working properly, it's just the ack that fails. the bot is confused by the API response and panics. [01:00:07] legoktm: That said, I'm pretty confident that the use of nscd will have lightened the load enough that poor little crappy dnsmasq will fare better. [01:00:28] :D [01:05:19] YuviPanda: Re https://gerrit.wikimedia.org/r/#/c/158111/, what are the blockers for doing this for Tools as well? The flow is $INSTANCE => labmon/graphite => icinga.wikimedia.org or icinga.wmflabs.org? For example, tools-exec-12 has a Puppet error (Trusty and JDK, IIRC). Who would have the power to press the "Is being handled" button on which UI to silence that for the moment? [01:06:00] scfc_de: no blockers except the checks aren't showing up in icinga yet, and I don't have access to neon so I need to grab someone with access and debug [01:06:17] scfc_de: as for who would have the power, uh, gooooood point. I guess we could get you added to the nda group that gets icinga access? [01:06:20] you probably already [01:06:21] are [01:06:27] can you see icinga.wikimedia.org [01:08:17] someone on morebots tried apt-get install mwclient? [01:08:49] YuviPanda: Nope, I need to ask mutante for that still :-) (or finally finish that damn bug). So beta hasn't active alarms yet, just the scaffolding? [01:09:05] scfc_de: it theoretically should have active alarms, but does not [01:09:10] and I need to investigate to find out why [01:09:42] I should go now [01:09:42] cya [01:11:30] YuviPanda: Bye. [01:23:28] !log deployment-prep Started jobrunner service manually on jobrunner01. [01:24:15] !log deployment-prep Many jobrunner errors like "wikiversions-labs.cdb has no version entry for `amwiki`" with various wiki names [01:27:16] legoktm: Can you check the beta job queue depth (or teach me how to) and see if it is going down? [01:27:41] bd808: I just use mwscript showJobs.php --wiki=enwiki --group [01:27:45] Flow\Jobs\WatchTitle: 0 queued; 1422 claimed (1422 active, 0 abandoned); 0 delayed [01:27:49] so, looks much better [01:28:08] (all the other job types are gone) [01:30:22] I wonder if those flow jobs are what's making all the error log output for the jobrunner? Looks like everything else is processed now. [01:31:58] :/ [01:32:02] pastebin an error log sample? [01:32:11] or where should I look? [01:32:27] 3Wikimedia Labs / 3deployment-prep (beta): beta labs job queue stuck (again) - 10https://bugzilla.wikimedia.org/70374#c1 (10Bryan Davis) Manually started jobrunner service on deployment-jobrunner01. $ mwscript showJobs.php --wiki=enwiki --group Flow\Jobs\WatchTitle: 0 queued; 1422 claimed (1422 active, 0 a... [01:32:42] legoktm: /var/log/mediawiki/jobrunner.log on deployment-jobrunner01 [01:33:02] The wikis it's complaining about are not beta wikis [01:33:14] Not sure where the weird jobs come from [01:33:32] ohhhhhh [01:33:33] that was me [01:33:41] jerk :) [01:33:52] I did sudo -u apache foreachwiki runJobs.php --type=LocalRenameUserJob earlier [01:34:12] so I could clear out the global rename rather than doing it manually across 15 wikis [01:34:23] ah [01:34:26] and I guess foreachwiki runs across wikis that don't exist on beta [01:34:38] it runs on all.dblist [01:34:55] which probably doesn't have any beta wikis in it [01:35:05] * bd808 looks [01:35:46] I bet it doesn't know about realm specific files [01:36:30] Hmmm it calls getRealmSpecificFilename [01:36:32] but it keeps spamming the logs? [01:37:07] I tail -f 'd the log file and new stuff about nlwiki keep showing up [01:37:28] yeah it was a continuous loop of errors [01:40:26] legoktm: foreachwiki does the right stuff so you didn't do it I don't think [01:40:34] oh [01:41:21] !log deployment-prep Killed old jobs-loop.sh processes on deployment-jobrunner01 [01:44:12] 3Wikimedia Labs / 3deployment-prep (beta): beta labs job queue stuck (again) - 10https://bugzilla.wikimedia.org/70374#c2 (10Bryan Davis) $ tail -5000 /var/log/mediawiki/jobrunner.log|grep 'Fatal error'|grep 'has no version entry'|awk '{print $9}'|sort|uniq -c 52 `afwikibooks`. 51 `afwikiquote`.... [01:47:44] Flow jobs still appear stuck. [05:14:26] !log deployment-prep Bad jobs in job queue filled up /var on jobrunner01 and killed jobrunner script. Leaving down for now until I find out how to delete the bad jobs. [05:14:57] open up redis-cli and "flushdb"? :P [05:16:11] bd808: are the jobs claimed? [05:16:17] there's a JobQueue::delete() [05:17:12] legoktm: All I know bout jobserver and jobs is they break sometimes :) Please feel empowered to fix if you know how. [05:17:30] You can blame me if things go badly [05:17:42] "Bryan said I should!" [05:17:55] off topic: PHP Notice: Undefined index: wmgExtraLanguageNames in /mnt/srv/scap-stage-dir/php-master/includes/SiteConfiguration.php on line 305 [05:18:07] bd808: which are the bad jobs? [05:18:23] legoktm, where? [05:18:29] yeah. I haven't figure out why that var is not being seen. It seems to be set in the config [05:18:30] oh, beta? [05:18:46] jeremyb: yeah beta [05:18:50] yeah, when running mwscript showJobs.php --wiki=enwiki --group [05:18:51] I've seen it a few times when using eval.php [05:18:54] never on prod though [05:19:26] legoktm: Whatever jobs are left ~3 minutes after starting the jobserver again on deployment-jobrunner01 are the bad ones I'd say [05:20:02] is the runner down still? [05:20:25] Yeah I didn't start it back up. `sudo service jobrunner start` will start it again [05:20:45] and it will fill the disk an die again if those jobs aren't purged somehow [05:20:45] someone want to log something? :) [05:21:22] !log morebots ping [05:21:23] morebots is not a valid project. [05:21:29] grrr [05:21:33] !log tools morebots ping [05:21:36] Logged the message, Master [05:21:38] I think it's logging but not acking here [05:21:49] orly? :-) [05:21:55] Something about the wikitech changes we deployed today [05:22:08] Andrew looked at it earlier [05:22:36] !log deployment-prep restarted jobrunner on deployment-jobrunner01 [05:22:38] Logged the message, Master [05:22:43] yeah http://dpaste.com/078DJ46.txt [05:22:59] lol, that's not valid JSON xP [05:23:36] bd808: so, looks like the flow jobs [05:23:40] cirrusSearchLinksUpdateSecondary: 0 queued; 0 claimed (0 active, 0 abandoned); 13 delayed [05:23:40] Flow\Jobs\WatchTitle: 0 queued; 1422 claimed (0 active, 1422 abandoned); 0 delayed [05:23:47] legoktm: yup [05:23:47] delayed jobs are well, delayed [05:24:10] The flow ones were what was left earlier [05:24:39] prof of morebots doing its job -- https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Deployment-prep/SAL&diff=125721&oldid=125719 [05:24:59] legoktm, bd808: here's the fix: http://dpaste.com/3RX66W5 [05:25:21] evil, but works [05:25:28] should probably fix lucene [05:25:35] heh. we should probably fix wikitech tomorrow [05:25:39] I'm guessing it's a version incompatability somewhere [05:26:25] andrewbogott_afk: jeremyb looked up the source of the morebots issue, something to fix in wikitech config -- http://dpaste.com/078DJ46.txt [05:26:36] * bd808 really leaves for reals now [05:28:14] JobQueue::delete() says Deleted all unclaimed and delayed jobs from the queue [05:28:18] except all the flow jobs are claimed [05:28:52] !log deployment-prep stopping jobrunner on deployment-jobrunner01 [05:28:55] Logged the message, Master [07:07:53] Hi everybody| Quick newbie question. I'd like to run an SQL command on the wikidatawiki.labsdb database, to get a list of all the links from Wikidata to Commons. What sort of account and permissions do I need? [07:08:26] you can use http://quarry.wmflabs.org/ [07:09:11] Jheald: http://quarry.wmflabs.org/query/439 for example [07:10:11] Thanks! That's exactly what I need. [07:55:23] ssh jzerebecki-test.eqiad.wmflabs now gives me Permission denied (publickey). where as it worked fine yesterday. Any ideas how to fix it? [08:07:31] legoktm: my query ran (for 25 minutes), but then it's given me nothing back http://quarry.wmflabs.org/query/441 though it's fine with a limit http://quarry.wmflabs.org/query/443 Okay, so I was expecting 430,000 rows, which is quite a lot for the quarry front-end to cope with; but all the same, is there any way I can get to the output of the query? [08:09:17] on a new labs instance puppet run i get: mounting labstore.svc.eqiad.wmnet:/project/puppet-cleanup/home failed, reason given by server: No such file or directory [08:31:09] Jheald_afk, https://tools.wmflabs.org/jeremyb/wb_items_per_site.1409819129.bz2 [08:44:21] jeremyb: Thanks, it's great to have the whole list. But what I also needed was the corresponding Q-numbers, because I would like to compare it with the output of Magnus's WDQ tool, so that ultimately I can find out how many article-like Q numbers have sitelinks to categories on Commons, etc [08:44:43] so, only ns0 then? [08:45:19] ns0 ? [08:46:36] main namespace [08:46:50] thanks Nemo [08:47:01] https://www.mediawiki.org/wiki/Help:Namespaces [08:47:43] Yes, I suppose so [08:47:58] Yes [08:50:39] aka mainspace [08:50:44] And I guess I need a join to the page title to get the Q-number, because I may have been wrong to think that ips_item_id would necessarily correspond to the Q-number [08:51:52] YuviPanda, cc0 is not a license [08:51:53] :P [08:55:16] Jheald, what's an example where it didn't match? [08:55:27] -- though I was basing that assumption on this query by Multichill https://tools.wmflabs.org/multichill/queries/wikidata/cross_namespace.sql [08:56:27] jeremyb: no example, I was just following Multichill's query without thinking about it one way or the other [08:57:52] so assuming that all I needed was ips_item_id [08:59:13] (03PS1) 10JanZerebecki: Add empty files necessary for puppet class icinga::monitor. [labs/private] - 10https://gerrit.wikimedia.org/r/158340 [09:00:02] Jheald, https://tools.wmflabs.org/jeremyb/wb_items_per_site.1409821121.bz2 ? [09:00:11] really don't know what you're looking for [09:02:53] just checking [09:07:03] jeremyb: thank you so much. Could you give me both those two columns together -- there's currently a mismatch of one line between them somewhere, presumably where somebody's created an item between the queries being run [09:24:07] Jheald, you should get your own account :) [09:25:22] well, my first question here was to ask what sort of account I needed :) [09:25:51] Jheald, ok, look again [09:25:59] you don't need a special account [09:26:04] just access to tool lab s [09:26:08] "tool labs" [09:26:30] okay, so I have a labs account, I need a "tool labs" account [09:26:50] then I can just write an sql query ? [09:26:56] yes [09:27:49] thanks [09:33:58] (03PS1) 10JanZerebecki: Add labs ssl key for puppet icinga::monitor. [labs/private] - 10https://gerrit.wikimedia.org/r/158344 [09:46:48] Jheald_away, did i forget to drop the link to the new one? anyway, same dir [09:46:52] sorry [09:49:43] (03PS1) 10Lokal Profil: Make table name correspond to country name [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/158346 [10:05:20] (03CR) 10Lokal Profil: "This should also be behind ErfgoedBot not triggering Unused images etc. on se-arbetsliv" [labs/tools/heritage] - 10https://gerrit.wikimedia.org/r/158346 (owner: 10Lokal Profil) [10:28:52] er, I'm getting 401 errors trying to access the image itself on https://wikitech.wikimedia.org/wiki/File:Puppetquery.png [10:44:48] meh [10:44:51] I am having problems too [10:45:18] @notify andrewbogott_afk [10:45:18] This user is now online in #wikimedia-tech. I'll let you know when they show some activity (talk, etc.) [12:44:35] with the wikitech changeover, is it known that some (all?) images are not displaying? [13:28:48] shouldn't valhallasw's flask_mwoauth be installed on labs? or must be installed userwide? [13:54:25] <_joe_> !log deployment-prep stopped puppet on the appservers but mw03, testing an apache change [13:54:27] Logged the message, Master [15:30:13] !log bots add user billinghurst to wm-bot/configuration/admins [15:30:16] Logged the message, Master [15:42:32] !log bots add user billinghurst to wm-bot/configuration/admins, fix case [15:42:34] Logged the message, Master [15:53:55] sDrewth: There are still a number of kinks being actively worked on. [15:54:28] Coren: that sums up WMF :-) [15:54:36] and probably half the staff [15:54:42] Pthbpbpt. :-P [15:55:19] at least there is hope for you lot [15:55:27] 3Wikimedia Labs / 3deployment-prep (beta): beta labs job queue stuck (again) - 10https://bugzilla.wikimedia.org/70374#c3 (10Bryan Davis) Tried to clean up bad jobs manually: redis 127.0.0.1:6379> keys *:jobqueue:LocalRenameUserJob:l-* 1) "nlwikiquote:jobqueue:LocalRenameUserJob:l-unclaimed" 2) "amwikiquot... [16:06:12] 3Wikimedia Labs / 3deployment-prep (beta): beta labs job queue stuck (again) - 10https://bugzilla.wikimedia.org/70374#c4 (10Bryan Davis) 5NEW>3RESO/FIX W00t figured it out. There is a special hash for the new jobrunner that tracks what queues to try and process: redis 127.0.0.1:6379> hkeys "jobqueue:agg... [16:06:42] !log deployment-prep Manually cleaned bogus LocalRenameUserJob jobs from redis [16:06:44] Logged the message, Master [16:45:57] 3Wikimedia Labs / 3Infrastructure: Internal DNS look-ups fail every once in a while - 10https://bugzilla.wikimedia.org/70076#c9 (10Tim Landscheidt) 5ASSI>3RESO/FIX I haven't seen any errors since Tuesday morning, so the change of the nscd configuration seems to have fixed the issue.