[00:58:32] Krenair: indeed. done. [00:58:54] Coren|2: the client processes were hung on mount [01:28:18] !logs [01:28:18] logs http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs [09:51:50] mutante-away paravoid - https://bugzilla.wikimedia.org/show_bug.cgi?id=46356 [09:51:53] Ryan_Lane ^ [11:40:35] !log deployment-prep Resetting the extensions checkout. Been broken for a few days because of extension renaming. [11:40:38] Logged the message, Master [11:44:55] !log deployment-prep /home/wikipedia/common/php-master/extensions : git remote update && git reset --hard origin/master && git submodule update --init [11:44:57] Logged the message, Master [11:46:46] !log deployment-prep removing local hack made to ArticleFeedback data/maintenance/DataModelPurgeCache.php [11:46:48] Logged the message, Master [12:57:17] Good morning, Labs! [12:59:10] hellooo [12:59:23] hi hashar [12:59:31] (bis) [13:06:03] * Coren waves. [13:17:01] * Silke_WMDE waves [13:45:44] hashar: petan; Ryan_Lane: Any known issues with labs atm? [13:45:50] Looks like /data is broken again [13:45:55] Krinkle where? [13:46:00] $ ll [13:46:00] ls: cannot access project: Transport endpoint is not connected [13:46:00] total 4.0K [13:46:01] drwxr-xr-x 3 root root 0 Mar 14 00:45 ./ [13:46:03] drwxr-xr-x 25 root root 4.0K Mar 19 06:38 ../ [13:46:05] d????????? ? ? ? ? ? project/ [13:46:07] cvn-app1 [13:46:17] Krinkle which project [13:46:20] cvn [13:46:22] it works to me on bots [13:46:29] maybe try remounting gluster there? [13:46:39] @labs-project-instances cvn [13:46:39] Following instances are in this project: cvn-apache2, cvn-app1, [13:46:46] ok can you try on apache2? [13:46:49] if it works [13:46:52] It works there [13:47:01] ok then maybe you need to reboot app1? [13:47:08] or at least restart gluster [13:47:24] How? And isn't there something that ensures this automatically? [13:47:29] or maybe it's just yet another brain split (tm) [13:47:34] I don't want to have to minitor cluster on every instance [13:47:41] gluster* [13:47:46] unfortunatelly gluster suck :/ [13:47:55] I am having same troubles and I am just as pissed as you are [13:48:19] So how do I remount or restart gluster? [13:48:52] https://wikitech.wikimedia.org/w/index.php?title=Special:Search&search=gluster&fulltext=Search&profile=all&redirs=1 [13:49:11] you need to restart automount service I think (you should switch to root instead of just using sudo, because it may cause some sudo outage) there is some gluster service in /etc/init.d [13:49:35] not sure but I think that it's so bugged that last time I wanted to restart it I had to reboot [13:49:45] because just restarting didn't help [14:00:31] Krinkle: I just reboot the instance [14:00:47] Krinkle: if it is broken on every instance, that might be an issue on the Gluster server [14:01:04] Krinkle: on your local instance you can look at /var/log/glusterfs/data-project.log* [14:02:47] Ain't nobody got time for that ;-) [14:03:08] I rebooted cvn-app1 already [14:03:11] its working again [14:04:45] !log cvn gluster had another hickup, /data became unaccessible ("Transport endpoint is not connected"). Bots down for almost 24 hours. Rebooting the instance fixed it. [14:04:47] Logged the message, Master [14:05:12] If this keeps messing up instances beyond automatic recovery I'll have to move back to toolserver. [14:06:21] Krinkle: tool server is going to die :-] [14:06:32] if you don't need /data/project you can use /mnt (mount of /dev/vdb ) [14:06:32] At least it has a working HARD DISK [14:06:41] another possibility is to export /mnt [14:06:47] using NFS mount on the other instance [14:06:48] We need the data shared across instances [14:06:58] I'm not going to spend another minute on it. [14:07:09] I can do lots of things [14:07:09] that is most probably what I am going to do on beta because Gluster keep ruining my productivity [14:07:09] I could also set up AWS and host it there. [14:07:34] one sure thing : tool server is going to die [14:07:54] Yes, we all know that. But not until labs isn't incomplete and unstable. [14:16:26] !log deployment-prep Attempting to fix central notice database schema for enwiki [14:16:28] Logged the message, Master [14:26:48] !log deployment-prep getting lazy, dropping Central Notice tables from enwiki and rerunning updater. [14:26:50] Logged the message, Master [14:55:27] Krinkle: Gluster is on its way out. [14:55:37] (Thankfully) [15:01:15] !log deployment-prep Updated database enwiki [15:01:17] Logged the message, Master [15:52:51] !log wikidata-dev wikidata-dev-9 deleted old temp. test-coverage files [15:52:53] Logged the message, Master [16:00:47] andrewbogott_afk: I played with the puppet modification again. I don't know how to make Wikidata's init.pp notice mediawiki's mw-extension.pp file. [16:03:26] Silke_WMDE: I can take a look. Are you including the module? [16:03:34] yes [16:04:06] I get nothing but the Nuke, SpamBlacklist and ConfirmEdit extensions [16:07:02] OK, lemme set up a test box and see what I can see. [16:09:20] andrewbogott: If you want you can come onto mine [16:09:37] thanks… I think I want to start fresh [16:09:43] ok [16:09:48] If that turns out to be hard I might rethink :) [16:10:07] So that I can copy your settings, which instance are you working on atm? [16:10:46] wikidata-repotest and wikidata-client-test [16:11:54] And you're seeing the same extension behavior on both? [16:12:55] yes [16:13:50] 'k [16:19:22] well, and now my homedir on bastion is read only… am I late to that particular party? [16:20:15] heh [16:20:27] always welcome [16:20:43] Means I can't use a new instance because it can't write to .ssh :( [16:20:55] ouch [16:21:59] Only I'll have to leave in a moment. Could you drop me an e-mail with your results of inquiry? [16:24:08] yep [16:24:30] thanks and CU soon! [16:49:58] Ryan_Lane: Amusing buglet. If you remove a puppet class from a project config, instances already having that class keep it but you can't see or remove it anymore. [16:54:30] Coren: Do you use bastion-restricted or bastion1? [16:54:59] I'm not cool enough for bastion-restricted. :-) [16:55:26] Is your homedir on bastion1 r/w? [16:55:36] I mean, in general it is, but is it right this minute? [16:56:11] * Coren checks [16:56:31] touch: cannot touch `foo': Read-only file system [16:56:49] curious [17:04:55] is gluster in trouble? our shared gluster partition is read-only, even after a reboot [17:05:36] gwicke: I'm investigating but don't know what's up yet. [17:06:37] gwicke: your home dirs? OR something else? [17:06:48] gluster isn't /in/ trouble. gluster /is/ trouble. [17:09:10] andrewbogott: /data/project in the visualeditor group [17:09:29] gwicke: OK. I'm going to restart your volume, and then ask you to reboot again... [17:09:31] (but, give me a minute) [17:09:44] ok [17:12:02] gwicke: OK, can you reboot and tell me if it's better? [17:12:17] will do [17:16:24] <^demon> Can I get a new project please? [17:16:33] andrewbogott: it is still read-only [17:16:38] dammit [17:16:50] * Coren says some really unkind things about webserver.pp [17:16:58] ^demon: Name? [17:17:11] <^demon> solr? [17:18:07] what's your name on wikitech? [17:19:54] ^demon: Presuming your wikitech name is 'demon', the project is created. We are having some filesystem issues though so… no guarantees that you'll be able to actually do anything :( [17:20:22] <^demon> Ok, thanks. [17:26:10] <^demon> Hmm, can't create instances, but can on my other project. [17:30:05] gwicke: Are homedirs in that project working properly? [17:30:43] andrewbogott: yes, my home is r/w [17:30:53] and properly shared among other instances in the project? [17:30:59] only the /data mount is broken as far as I can tell [17:31:28] andrewbogott: I think so, as it has the usual non-local contents in it [17:31:45] ok, sounds right. [17:35:12] Oh, Ryan_Lane, you're here! I just emailed. [17:35:21] yep. here [17:35:52] So… we have some gluster volumes that are suddenly read-only. [17:36:07] e.g. visualeditor-project [17:36:24] I restarted the volume and gwicke restarted the instance to force a remount. But, no joy. [17:36:43] The homedir on bastion has the same issue, although I am reluctant to reboot bastion until we have an actual fix. [17:36:48] it seems the clients are hung [17:36:55] bastion has the issue? [17:36:57] really? [17:36:58] hm [17:37:34] which bastion? [17:37:49] bastion-restricted, and Coren reports the same on bastion1 [17:38:02] I just checked the home directories on all of them [17:38:07] Yup. filesystem readonly. [17:38:14] oh. readonly [17:38:28] Mount and gluster agree that the volumes are r/w [17:38:31] but touch does not [17:38:31] Coren: so this isn't the mounting issue? [17:38:59] Ryan_Lane: Different one this time. tools- is currently not in trouble though. [17:39:10] marc@bastion1:~$ touch me [17:39:10] touch: cannot touch `me': Read-only file system [17:39:30] indeed [17:42:46] * Coren eyes the unused bricks with longing and mumbles something about NFS. [17:48:27] <^demon> Ryan_Lane: So, you merged the change but never built my package friday. This is ok, I'm going to revert it anyway...can't deploy just yet. [17:48:52] heh [17:49:23] * Ryan_Lane sighs [17:49:27] fucking gluster [17:49:48] Do you see what's wrong? [17:49:49] I stopped bastion-project and now gluster won't respond at all [17:49:53] no [17:50:09] Oh… gluster is always nonresponsive for 30 seconds or so after a volume start or stop [17:50:23] this was way more than 30 seconds :) [17:50:28] hm [17:50:38] seems it can't connect to any daemons [17:51:03] going to restart glusterd on all of them [17:53:04] We don't have quorum turned on do we? [17:54:10] Because that would explain the switch to read-only [17:55:17] Ryan_Lane: Folks in the gluster channel suggest that this is proper quorum behavior if a brick dies. Which, it does look like quorum is turned on [17:55:19] andrewbogott: yes, we do [17:55:33] no brick died, though [17:56:02] Well, could it have been just a brief hiccup? Wouldn't it stay read-only once it switched over? [17:56:21] gluster volume status does show a brick as N/A for port, but shows it as alive, though [17:56:24] which is absurd [17:56:28] I hate this filesystem [17:57:26] root@labstore1:/var/log/glusterfs/bricks# gluster volume status bastion-home [17:57:26] Volume bastion-home is not started [17:57:26] root@labstore1:/var/log/glusterfs/bricks# gluster volume start bastion-home [17:57:26] Volume bastion-home already started [17:57:31] -_- [17:57:55] root@labstore1:/var/log/glusterfs/bricks# gluster volume stop bastion-home [17:57:55] Stopping volume will make its data inaccessible. Do you want to continue? (y/n) y [17:57:55] Volume bastion-home is not in the started state [17:58:00] what a giant piece of shit [17:59:01] failing create due to lack of quorum [17:59:10] of course, it doesn't say which brick it can't connect to [18:34:44] Krinkle: Gluster is on its way out. [18:34:45] ETA? [18:35:31] That is, ETA to not having instances randomly loose /data with "Transport endpoint is not connected" requiring a manual reboot. [18:38:44] Krinkle: it's being actively worked on [18:39:00] replacing a filesystem isn't exactly an easy task [18:47:27] Ryan_Lane: No, it's actually a very /easy/ task. Doing it so that it is an improvement is the tricky part. :-) [18:48:44] how often does code get automatically updated on beta labs? also - do config changes get automatically deployed on betalabs when merged? [18:51:00] Coren: heh [18:51:07] awjr: every 3 minutes, I believe [18:51:12] hashar would know best [18:51:30] thanks Ryan_Lane - yeah i know, im keeping an eye out hoping he'll sign on soon :) [18:51:46] Ryan_Lane: do you know if config changes get auto-deployed too? [18:53:01] awjr: no clue [18:53:03] sorry [18:53:44] I have a feeling some of the glusterfsd processes are hung [18:54:45] ohho [18:54:47] ok thanks Ryan_Lane [18:54:52] yw [19:01:31] Coren: for bastion it seems that a glusterfsd process was deadlocked [19:01:50] I killed that one process, then force started the volume [19:01:53] now bastion works [19:02:15] Huh. Did that also fix the readonlies on the other instances that had the problem? [19:02:32] any instance in bastion [19:03:56] seems project data also has this issue [19:04:08] I bet a whole shit-load of glusterfsd processes are deadlocked [19:04:18] fun times [19:05:30] win! I can write to files again on bastion-restricted! [19:05:48] it seems like labstore1 may be the problem here [19:06:08] Things are deadlocking on labstore1? [19:06:34] glusterfsd processes are deadlocking, I'm not totally sure it's labstore1 yet [19:07:26] Well, if there's more than one that got into that wedged state in the same period, chances that it's not caused by the server seem... improbable. [19:07:34] looks like 3 and 4 aren't responding for project data [19:07:44] Ryan_Lane, overall this seems like a win, right? Better read-only than splitbrain. [19:07:50] yes [19:07:58] this would have been massive corruption otherwise [19:08:27] and it would have been silent corruption for quite a while before it was noticed [19:08:34] it's still bullshit [19:08:54] The same way having one broken leg is better than having two broken legs. It's strictly better, but still undesirable. [19:09:21] OK, so, not a win, but a smaller loss [19:09:55] so, I killed the process on labstore1 for project data and it started working too [19:10:01] I want to find another broken project [19:10:09] so that I can test killing a process on another node [19:10:15] <^demon> Soo, I can't create instances on one project, but can on another. Permissions? [19:10:19] just to ensure it's not a fluke [19:10:26] ^demon: which projects? [19:10:29] visualeditor was also read only [19:10:33] <^demon> I can on gerrit, can't on solr. [19:10:34] and are you a projectadmin on both? [19:10:41] um… the project dir, not the homedir [19:10:44] <^demon> Dunno about the latter, it was just created for me today :) [19:18:37] so, I think I want to reduce the brick count from 4 to 2 [19:20:09] killing the glusterfsd process on labstore3 didn't fix the read-only status for visualeditor [19:20:21] now to try it on labstore1... [19:21:02] if we drop the brick count, it'll drop the filesystem from 62 to 31TB [19:21:11] but we're only using 10TB right now, and almost all of that is dumps [19:22:22] killing the process on labstore1 didn't fix visualeditor :( [19:23:21] wait [19:23:26] I take that back [19:23:32] labstore1 is definitely the issue [19:31:07] Coren, andrewbogott: know of any other projects that were having this issue? [19:31:33] None that have piped up yet, anyways. Lemme do a quick chekc. [19:33:47] Ryan_Lane, just bastion and visualeditor [19:34:53] None of {bots,webtools,tools} [19:35:13] Visualeditor? Then it's clearly James_F's fault. [19:37:13] :) [19:38:48] salt to the rescue [19:38:51] wikistats is broken [19:39:12] so is ipv6 [19:40:52] Coren: what's the status of bug 46170? [19:41:39] hashar: hi :) have you seen my email? [19:42:57] giftpflanze: Hm. I should update that. fcron is completely unsupported, so not really plausible. Pretty much the only solution that isn't going to risk breaking other cron users is the wrapper script idea; would you like me to give you a hand coding it? [19:43:33] *sigh* [19:43:50] this is miserable [19:44:50] To be honest, I'm still having a difficult time figuring out your precise use case. [19:45:10] I also don't understand why timezones would be necessary... [19:45:33] it's the dst part that is important to me [19:45:45] can you describe the usecase? [19:45:59] it's not that it couldn't be coded by myself, but i wanted to avoid that [19:46:09] !log deployment-prep -bastion : restarting puppet. Restarting beta autoupdater. [19:46:12] Logged the message, Master [19:46:13] ok, let me try it [19:47:57] awjr: yeah was busy replying to it :-] [19:48:03] the usecase roughly is: i have some scripts for dewiki that have to run at midnight CET/CEST [19:48:18] awjr: you get the answers there. TL;DR : yeah it is updating automatically, even mediawiki-config.git since last week. [19:48:36] awesome, thanks hashar - sorry for harassing you :p [19:49:00] hashar so are the reasons we're not seeing changes on betalabs right now due to gluster issues [19:49:00] ? [19:49:22] awjr: do harass me. Or you will never get anything from me =) [19:49:26] :p [19:49:32] giftpflanze: So it runs at 24 intervals, except twice a year when it will do 23 and 25? [19:49:38] 24h* [19:49:58] hashar what is the frequency with which updates go out to betalabs? [19:50:03] awjr: the beta auto updater was disabled for most of the day. I probably forgot to restart it. I did the update of MobileFrontend a few minutes earlier so that should be live [19:50:10] awjr: rest is in d4 mail :-] [19:50:16] cool :) [19:50:22] Coren: so to speak [19:50:43] awjr: I will probably make Jenkins to report beta update failures in this channel [19:50:56] that would be great [19:51:08] giftpflanze: why does it need to run exactly at midnight de time? [19:51:19] err CET/CEST time [19:51:39] awjr: now I need to troll^Wreply some more emails :-D [19:52:14] giftpflanze: The "wrapper" would be as small as "[ $(TZ=:Europe/Berlin date +%H) = 0 ] && yourscript", so I think that's manageable. [19:53:25] Ryan_Lane: because that is the time where the new day in dewiki begins. at that time i add a meta header for the new day in some pages [19:53:25] scfc_de: sounds good [19:53:25] giftpflanze: (Or rather "= 00".) [19:53:30] * Ryan_Lane nods [19:55:09] giftpflanze: I agree with scfc_de; I'm hesitant to replace a system daemon whose behaviour is relied upon by all users for an edge use case that has a simple workaround; especially if it runs as root and would need a security review. [19:55:22] mh [19:55:25] hashar it looks like MF code hasn't been updated on betalabs - Special:Version shows we're on f7228495e54fbfc5db7bec4096bd772227367b78 but it should be e82a07bc5bbff1971d197cb92c4dc34877d0f92d [19:55:42] i still don't think that it really is such an edge use case [19:55:52] But rather than Coren trying to press all possible options in one script, I would prefer individual users solving this through the wonders of shell :-). [19:57:04] giftpflanze: As far as I know, yours is the only tool where timzeone-specific scheduling is a requirement. :-) Most other wikis tend to do daily tasks along UTC; but dewp is probably different in that almost all of its users are geographically concentrated. [19:57:29] true [19:59:42] hashar actually im seeing it now :) [20:00:30] awjr: :-] [20:02:20] awjr: I suck at writing documentation :( [20:20:45] hashar: is the beta updater a Jenkins job now? [20:20:55] chrismcmahon: not yet :D [20:21:05] chrismcmahon: just the mediawiki config and the db update of `enwiki` [20:21:46] hashar: nice! where would I see these jobs? I was looking around Jenkins last week and didn't notice them [20:41:42] * ^demon pouts [20:42:40] <^demon> hashar: Do you still use that gerrit-dev-zap instance for anything? [20:47:03] ^demon, is your new project working properly now? [20:47:17] <^demon> Still can't create instances. [20:47:26] <^demon> (Not an error, just no link to create them) [20:47:43] <^demon> Ah, I need projectadmin. [20:47:44] <^demon> https://wikitech.wikimedia.org/w/index.php?title=Special:NovaInstance&action=create&project=solr®ion=pmtpa [20:47:54] <^demon> (says nowai you haxor) [20:48:32] try now? [20:48:57] ^demon: no idea what `gerrit-dev-zap` is, does not seem to be something I created. [20:49:06] <^demon> andrewbogott: Sweet, thanks dude. [20:49:20] ^demon: Sorry, I thought that part was automatic. [20:49:55] <^demon> All good now :) [20:50:08] <^demon> hashar: It looks like it was built in November, then unused since. [20:50:22] I guess you can delete it so :-] [20:51:00] Anything whose name ends with '-zap' is necessarily cool. [20:51:38] Like, "submit-spreadsheet". Uncool. "submit-spreadsheet-zap" -> 120% cooler. [20:56:27] <^demon> Playing with Zookeeper is making me want to play Zoo Tycoon. [20:56:30] <^demon> I wanna be a zookeeper! [20:57:14] that's what i keep saying i'd do whenever i get generally annoyed by all things computer-related :) [20:58:47] <^demon> mutante: Let's quit this and become zookeepers :) [21:01:00] ^demon: :) http://www.sfzoo.org/jobs-general-information [21:01:19] 30% discount in the gift shop, wtg [21:01:26] <^demon> Sweet. [21:06:03] andrewbogott: what was ^demon's issue? [21:06:11] <^demon> projectadmin :p [21:06:44] Yeah, I created the new project with him as initial user but it didn't make him admin. Maybe that's correct behavior; I haven't made a new project in a while. [21:08:15] ah [21:08:29] andrewbogott: did you use the checkbox for projectadmin? [21:08:38] I think there's a bug in that code [21:08:45] I think I didn't :) [21:08:49] ah [21:08:51] heh [21:08:54] that'll do it :) [21:11:50] <^demon> Hmm, I can't seem to ssh to my new instance. [21:11:54] <^demon> (s), even [21:12:41] ^demon: did the puppet run finish? [21:12:46] <^demon> Yes. [21:12:49] <^demon> Ah, user storage. [21:12:50] <^demon> Unable to create and initialize directory '/home/demon'. [21:13:03] yeah, that's slower nowadays [21:13:07] to make gluster more stable [21:13:11] * Ryan_Lane grumbles [21:35:18] <^demon> Ryan_Lane: Homedirs still not created. Am I just not being patient enough? [21:47:40] Coren, I have a note to add a sudo policy for new security users: "chown -R : /data/project/" [21:47:59] Is that moot now, since homedirs are configurable and they're going to default to /data/project/? [21:48:06] In tools, at least? [21:49:09] It's not moot; that's the ugly hacking workaround some people prefer to a properly fined tuned tool to help manage ownership of shared files between a tool and its maintainers. :-) [21:49:20] :-P [21:49:59] :) ok [21:50:11] andrewbogott: The non-snarky answer is "Yes, still needed". If you want to background, ask Ryan. :-) [21:50:23] s/want to/want the/ [21:52:43] ^demon: it shouldn't take that long [21:53:43] seems the daemon hasn't made a change in an hour [21:55:14] <^demon> This is on solr-zk[0-2], solr[0-3] and solr-mw. [21:55:22] Creating volume tools-project with bricks labstore1.pmtpa.wmnet:/a/tools/project labstore2.pmtpa.wmnet:/a/tools/project labstore3.pmtpa.wmnet:/a/tools/project labstore4.pmtpa.wmnet:/a/tools/project [21:55:28] is that the project name? [21:55:41] hm [21:55:45] hm [21:55:45] no [21:55:49] that makes no sense [21:55:55] why would it even try to create that volume [21:56:38] ^demon: which project is this? [21:56:54] <^demon> solr. [21:57:28] hm [21:57:37] oh [21:57:38] I know [21:58:21] ^demon: do you need project data as well as home directories? [21:58:34] we made a change recently where new project don't get home or project volumes [21:58:39] *projects [21:59:01] this is to avoid creating a shitload of gluster volumes [21:59:16] <^demon> No, I don't need project storage. [21:59:25] ok. I configured your project for home dirs [21:59:35] project admins can do this via "manage projects" [22:00:09] <^demon> Ah, I see this now. [22:21:34] Coren: Can you create another tool for me? Name is "ecmabot" [22:30:31] Coren, the chmod rights are for everyone in the service group or just the special service user (with the same name as the group)? [22:58:37] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by Krinkle link https://www.mediawiki.org/w/index.php?diff=663638 edit summary: [+5] /* Access */ [22:59:03] petan: Do you know if there is a page or script to create a "tool" account in tool labs? [23:00:25] Change on 12mediawiki a page Wikimedia Labs/Tool Labs/Help was modified, changed by Krinkle link https://www.mediawiki.org/w/index.php?diff=663639 edit summary: [+2] /* Published directories */ space [23:14:12] petan: andrewbogott: Do you know if anyone other than Coren can create "tools" (e.g. create the user account and set up the directory and stuff) [23:14:49] is the process documented yet? I'd like to create a bot today and have it run in tool labs with jstart [23:35:14] Krinkle: we're still working out the interface for adding tools [23:35:26] Krinkle: it's manual for now, and Coren handles it [23:35:40] we put out a kind of RFC about it a week or so ago [23:36:07] we aim for it to be automated, though