[00:00:23] We can process about 1k edits per second [00:01:46] So we can run this on English Wikipedia in less than 24 hours no problem. [07:41:41] halfak: I was asleep [07:41:44] \o/ [13:49:59] o/ Amir1 [13:50:11] I got out pesky deleted rev bug in Wikilabels too [13:53:30] o/ halfak [13:53:38] I'm in the birthday party [13:53:50] I was presenting and your ping came out :)))) [13:54:05] awesome [13:54:38] I thought it happens because the default window is not set correctly [13:55:06] Woops! Sorry to interrupt :) [13:57:16] hah [13:57:20] it's okay [13:57:23] I need to go [13:57:32] but I'll be online when I come home [13:57:33] :) [13:57:36] o/ [13:57:50] I've got a few meetings today, but I'll be around on and off. [15:03:54] I found a work-around for our doc problems. [15:03:55] !!! [15:04:04] It turns out that the bug does not exist in sphinx 1.2! [15:09:42] WORKAROUND :)))) [16:17:08] halfak: https://pbs.twimg.com/media/CYrhJqmWcAA0olo.jpg [16:17:16] there are lots of pictures to come :) [16:17:29] anyway [16:18:14] I think the dump extract might have a problem re default window [16:43:01] halfak: what else we should do re. the dump extractor? [16:44:11] Amir1, run it for all our wikis. urwiki first [16:44:38] yes [16:44:41] awesome :) [16:54:22] halfak: re ur wiki. What time span? [16:55:26] 1 full year ending at the dump date [17:02:29] kk [19:17:05] halfak: hey [19:17:07] ores page? [19:17:11] Yup [19:17:16] Looks like our queue is full. [19:17:21] I'm going to check on our workers. [19:17:33] ok [19:18:18] My port forwarding is broken :( [19:18:20] Can't see flower. [19:18:22] Weird. [19:18:47] * YuviPanda should setup flower publicly available with auth at some point [19:18:52] Yes please :) [19:19:11] hmm interesting, puppet hasn't run on ores-worker-01 for a while [19:19:20] OK. looks like no jobs are being processed on worker-01 [19:19:23] hmm [19:19:25] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: DNS lookup failed for ores-redis-02.ores.eqiad.wmflabs Resolv::DNS::Resource::IN::A at /etc/puppet/modules/ores/manifests/redisproxy.pp:6 on node ores-worker-01.ores.eqiad.wmflabs [19:19:27] Warning: Not using cache on failed catalog [19:19:29] Error: Could not retrieve catalog; skipping run [19:20:08] Shall I restart the workers? [19:20:11] Or wait? [19:20:58] halfak: restart [19:21:16] ok fixed puppet there [19:21:20] Woah. I don't have sudo on ores-worker-01? [19:21:32] try now [19:21:37] bah [19:21:40] ldap is broken somewhoe [19:21:43] Tried logging out and back in. Now can't log in [19:21:49] jesus [19:21:56] ldap is broken?! [19:22:03] I can log into -02 [19:22:06] ok [19:22:32] Restarting -02 [19:22:39] Hanging.... [19:22:39] I'm restarting -01 [19:22:43] since I Can restart as root [19:23:08] Still hanginging... [19:23:25] yup [19:24:09] Redis server seems to be online [19:24:15] halfak: I kill -9'd the old server process [19:24:17] and did a start [19:24:19] than a restart [19:24:22] oh! my restart finished on -02 [19:24:24] ok [19:24:31] Not serving though [19:24:31] :/ [19:26:13] hmm [19:26:15] celery is up [19:26:16] at least on -01 [19:26:27] and I see traffic [19:26:45] Was up for a moment on -02, but not anymore [19:27:02] Oh wait. maybe we are [19:27:07] Ahah! [19:27:12] There it goes! [19:27:19] works fine? [19:27:24] Working on -03 [19:27:38] -02 seems to be up. It seems to be processing requests [19:28:05] And ores.wmflabs.org is *up* [19:28:22] woo [19:28:33] icinga should recheck soon [19:29:26] halfak: do you want to try postmortem now or later? [19:29:35] Still bringing things back. [19:29:38] working on -04 [19:29:38] ok [19:29:49] I think we should figure out what happened before the postmortem, right? [19:30:25] right, postmortem is figuring out what happened [19:30:31] and then we write it down for the postmortem report [19:30:35] step 1 is always 'bring the thing back up' [19:30:47] postmortem might be a bad name [19:31:03] -04 is coming back online [19:31:18] Do you mean, investigate the issue now or later? [19:31:29] -04 is up [19:32:11] Sync is back online [19:32:38] YuviPanda, by "try postmortem" do you mean investigate the issue? [19:32:58] I think that can wait until the end of the metrics meeting [19:32:58] probably [19:33:00] words [19:33:02] ok [19:33:08] 'find out what happened' [19:33:08] I'll keep monitoring [19:33:16] Yeah. Sounds good. [19:33:48] * halfak wonders if we should have "hard kick" script that we can run as a first attempt at dealing with downtime. [19:34:30] halfak: that's already there in fab [19:34:39] Oh? [19:34:41] fab restart_celery restart_uwsgi [19:34:42] * halfak looks at fabfile [19:34:45] we can parallelize it too [19:34:47] with some param [19:36:08] halfak: sudo should work now [19:37:07] confirmed [19:38:16] ok [20:41:14] OK. Time to debug. [20:41:44] YuviPanda, are you lunching or do you have time to help me figure out what went wrong now? [20:41:53] I've broken production puppetmaster in the meantime [20:41:55] so working on fixing that first [20:41:57] Woops [20:42:02] No worries. [20:42:13] I'm going to read up on setting a redis timeout. [20:42:22] I'm guessing that the workers couldn't figure out how to reconnect to redis. [21:27:26] OK. So I have confirmed that we are configuring the redis socket timeout correctly. [21:27:46] If the underlying issue was a lost connection to redis, the workers should have recovered after 15 seconds. [22:14:17] hello [22:28:53] o/ pipivoj [22:28:56] Welcome! [22:29:51] Oh. You have a nicer netiquette here. :) [22:31:15] I've learned about it in Cobi's channel. I like it. [22:31:55] :D [22:31:58] Who is Cobi? [22:32:14] Oh! ClueBot author [22:32:27] Yes. The one. [22:34:48] Sorry for delay. Had to help out with some chores IRL. [22:36:03] I'm interested in machine learning and would like to offer my services with ORES or sth on par with it. [22:36:20] Great! No problem. I'm also going to be in and out. [22:36:43] What's your experience with ML? Also what are your feelings about programming in Python? [22:36:45] Have some educational background with it, but nothing practical. [22:37:27] I like Python as a language but I'm just beginner level. [22:37:45] Tho I like to think of myself as a fast learner. [22:38:09] Have more Java and C# experience. [22:38:16] woops. brb! :S [23:14:18] halfak, you there? [23:14:39] Sorry. Meetings back-to-back. Will be done in 45 mins or less (hopefully) [23:15:01] I have to go. [23:15:12] When would it be better to talk? [23:16:15] I'll try again tommorrow. Good night. [23:31:30] Bah! Sorry! :( [23:32:22] o/ YuviPanda [23:32:39] Haven't you solved prod fires [23:32:42] ? [23:32:47] Wanna look at ORES fire? [23:33:18] hello [23:33:25] I have, still on merge duty for a little but [23:33:27] sooo [23:33:36] No worries. Just checking in. :) [23:33:41] I guess we'll look at celery logs? [23:34:39] they didn't have much when I last looked [23:34:43] Yeah. I think so. I didn't see anything after the startup MOTD [23:37:25] hmm [23:46:17] halfak: I've to go now :( [23:49:42] OK. No worries. So long as we stay online for a while, we can pick this up later :) [23:50:16] I'll have some copy-pasta from the logs for you tomorrow :) [23:53:01] ok!