[00:10:31] Krenair: no clue, never had to do it on a machine before that did not have mwscript [00:12:58] yes, it should be the same as prod [00:17:51] bd808: huh, no idea what I wanted to do with getErrorsByType in that patch [00:18:33] the only thing I wondered about in the changes was dumping the whole response into the log context [00:18:35] AFAIU it produces the same information as getWikiText [00:19:11] well, the response is what typically contains the actual error message [00:19:27] imagine fetching a Commons thumbnail for example [00:20:04] the status message will be something like "request failed with HTTP 500" and the request body will be the error message that the scaler outputs [00:20:50] given that logging only happens in HTTP failure states I can't see it go terribly wrong [00:22:24] cool [11:50:46] hey has anybody successfully configured mediawiki to use postfix for SMTP? I've confirmed that my SMTP server works but I am hitting an error in MW: "SMTP: STARTTLS failed (code: 220, response: 2.0.0 Ready to start TLS)]" [16:37:10] I wish they would merge my commits in https://github.com/hhvm/oss-performance [16:53:48] Reedy, tried poking on #hhvm ? [16:53:53] Nope [16:54:22] Last time I asked for hhvm-help, no one responded... I ended up working it out :P [18:01:36] ori: sometime when you are in the mood to think about better logging and monitoring, check out https://www.mediawiki.org/wiki/User:BDavis_%28WMF%29/Projects/Logging,_metrics_and_monitoring and add/suggest additional things [18:03:16] bd808: talk page or IRC? [18:03:37] talk page probably for comments. easier to point others to [18:04:17] * bd808 has little hope of doing more than starting a discussion before next FY [18:05:38] actually, can I reply here (or in a PM?). a wiki page gives me writer's block. [18:06:59] sure [18:09:05] so, first, I'm happy to see that you're thinking about this, because I agree that we have a major problem, and I don't see things improving with the current allocation [18:12:49] I don't think the problem-statement gets to the heart of the matter. So we have lots of tools, so what? We have like two or three JavaScript testing frameworks in use but I don't think it's a crisis. [18:13:43] I think you get closest to the heart of the matter with :"The sheer number of tools also compounds confusion of where to look, how to sign up to get notifications and what to do if something looks wrong. " [18:14:28] yeah [18:14:45] hard to know what data is good data; where to look [18:15:07] I'd go farther and say that our automation and monitoring abstractions have not kept pace with the growth of our infrastructure [18:15:30] and that the result is that we're often stuck in a reactive mode, putting out fires, and doing lots of repetitive, ad hoc work [18:16:12] we've discussed all this inside ops, fwiw [18:16:18] including dedicated resourcing for it [18:16:26] it was actually part of our original proposal for this FY [18:16:46] what happened? [18:16:48] although I disagree that the amount of tools is the problem [18:17:17] budget wasn't as much as it was promised to be? :) [18:18:04] those tools aren't competitive with each other [18:18:11] (and we have more than the ones listed there, btw) [18:18:25] I don't think the problem-statement gets to the heart of the matter. So we have lots of tools, so what? We have like two or three JavaScript testing frameworks in use but I don't think it's a crisis. [18:18:26] the problem is that none of them gets the attention they need [18:18:33] yeah, I saw, agreed [18:19:06] I don't think "there weren't enough resources" is a satisfactory explanation either [18:19:27] we've also just gotten complacent, unambitious and demoralized [18:20:06] I don't think that's the case, no [18:20:28] I think I'd agree that problem #1 is that we have things (ELK, graphite, ..) that people and teams depend on in various ways but need to be kept running by people who are largely "borrowing time" from their "real work" [18:20:42] the fact that an proposal was made for this specifically shows that we are not unambitious, I think [18:21:01] btw, meta-question: what does any of this has to do with #mediawiki-core? [18:21:17] bd808: I agree with you that that's the case, but I don't think that that's the problem; that's a reason why we're not solving the problem [18:21:45] paravoid: nothing at all except I started poking ori here because it's our comfy shared space [18:21:48] ori: to respond to "not enough resources" you'd have to point me to other projects that should be sacrificed for this [18:22:35] paravoid: I don't think these issues are anyone's fault and I nkow Ops has been thinking about the issue [18:23:16] I think we have a bunch of lazy staff in cruise control [18:23:36] I started that page at a point where I was thinking about asking people if it could be my job. Today I have other ideas but I don't want the idea to disappear entirely [18:23:48] wow. that's pretty strong ori [18:24:04] that's still a resourcing problem, isn't it? [18:24:19] I guess so [18:25:42] hey mark [18:25:51] :P [18:26:45] CloudFlare had a good blog post today about HTTP 2.0 and SPDY, and I clicked around their site afterwards. Their staff is 183 in total: https://www.cloudflare.com/people/. Last month they provisioned their 69th data center. We can't even fantasize about a number like that because our automation across the stack is so poor. Everything requires massive amounts of manual work. [18:26:55] in any case, I think we all agree on the premise/general idea, right? [18:27:09] you're exaggerating, ori [18:27:51] I'm not blaming ops, btw [18:28:00] I'm not feeling insulted, I just thing you're wrong [18:28:09] caching pops have bastions, lvses and varnishes [18:28:12] I am also not suggesting that we fire or yell at people [18:28:15] all of these are fully puppetize [18:28:17] d [18:28:20] but I think we have to start motivating each other to think deeper about problems [18:28:50] we could do a bit better in installing systems etc. and there's obviously other things that could use better automation [18:29:24] but "massive amounts of manual work" isn't true [18:29:50] ori: you know a wikipedian works at CloudFlare, right? User:FiloSottile [18:30:24] from 179 staff, I count 17 SREs and 28 systems engineer [18:30:31] https://dpaste.de/hLJZ/raw [18:30:35] 4 web engineers and 1 web developer [18:30:45] and 1 Visual Designer [18:30:45] I'm in those logs a lot, and frequently [18:31:04] 3 "Product Engineers" and 4 PMs [18:31:19] but wtf, right? the first time that happens the first order of business should be to get the app server working again and the second order of business should be to make that not require manual intervention ever again [18:31:25] oh and 3 Network SREs [18:32:02] and 5 Network Engineers [18:32:07] and 1 Network Strategy [18:32:36] so yeah, you can't really compare us to a CDN [18:33:21] there is certainly truth to the "lazy staff" IMHO, and I'd add "incompetent staff" to it [18:33:30] maybe comparisons are beside the point. I think we can do a lot more with what we have [18:33:46] but I don't think our problems are for a lack of ambition [18:34:26] case in point: bd808's page is something that is new and ambitious, and we sort of all agree on the general idea [18:34:30] it has not happened [18:34:47] not yet :) [18:34:52] asia is something we also proposed and also did not happen for reasons that are beyond individual engineers [18:35:01] or mid-level managers [18:36:24] I think the comparison to CloudFlare was unfortunate, because it came across as a criticism of our efforts in provisioning DCs. Let me try and re-frame it in a way that I think you could agree with. [18:36:49] ok :) [18:37:42] One necessary (but not sufficient) condition for improving the status quo is a shift in how we think about problems [18:38:50] (go on?) [18:40:14] Manual, ad-hoc work should be considered a failure. Not as severe as an outage, but a failure nonetheless. The problem is not resolved when the condition that triggered the alert is ameliorated, but when steps have been taken to ensure the condition does not happen again. [18:40:58] can I respond? [18:41:09] No! [18:41:13] (yes.) [18:41:18] so, two things [18:41:36] first, I don't think we need that "shift" that you say -- it's not like we don't do any automation [18:41:52] we certainly could do /more/, but automation always needs an initial investment [18:42:16] I generally agree that we could and should do more of that and I want us to hire new people with that in mind [18:42:19] second, [18:42:33] I think this is generally orthogonal to the problem at heart [18:42:55] I don't think the sad state of our "log, metrics and monitoring" (as bd808 called it) has anything to do with automation [18:43:06] it takes time to try all those tools, set them up properly and integrate them [18:44:05] it need time, focus, attention and a coordinated effort [18:44:37] needs* [18:44:58] how do we get there? [18:46:02] I think dedicated resourcing for this is a good idea [18:46:08] these specific things also need cross-disciplinary participation and thinking. The best thign to maintain may not be the best thing to use, etc. And once things actually work they need some evangelization [18:46:19] and I do think that the transformation of techops to an SRE team is also a good idea [18:46:25] but I think these two are orthogonal with each other [18:46:38] well that's boring, in that case we all agree [18:47:08] ...which is why we've been discussing that since the reorg ;) [18:47:28] since *before* the reorg you mean [18:47:36] since the reorg was being planned [18:47:40] yeah, that [18:47:47] just clarifying [18:47:58] let's stop calling is the reorg and start calling it "the coup by PRoduct" [18:48:25] I think where I differ (or where I put a slightly different emphasis) is that I think culture is as important, if not important, than resourcing [18:48:34] and I think it's malleable through sheer force of will [18:48:41] not easy, but possible [18:48:42] which culture? [18:49:21] engineering culture, by which i mean the habits of thinking and interacting around engineering problems [18:50:29] specifically for this monitoring project, what is the kind of culture shift that you'd like? [18:51:01] intolerance of noise, first of all [18:51:40] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [18:51:46] nobody cares [18:52:37] we sort of have to wipe the slate clean, and get worse before we get better, in that i think we should just blanket-silence everything and reintroduce things in this fashion: [18:52:50] I think what you're missing is [18:53:02] add an alert, and verify that every time it gets raised somebody acts and there's a follow-up [18:53:19] so, this follow-up takes time, right? [18:53:30] and assuming we're not sitting idle, it means that we'd have to drop something else [18:53:31] i've proposed setting up a dedicatd, cross functional team in various places various times, including to you [18:53:37] i did NOT receive a lot of enthusiasm :) [18:53:40] so what should we drop? [18:54:00] (or we should hire new people, or we should fire some people and hire some new better people) [18:54:17] (not talking specifics btw, just talking about "resourcing") [18:54:47] the thing you're talking about needs an investment of time -- it will pay off, we don't disagree [18:54:53] but it needs an initial investment [18:55:52] drop: OCG, Flow. downsize: search, analytics. [18:56:06] i even got the response that we did not have enough availability for (people for) an availability team :) [18:56:16] so I'm not sure how you're not contradicting yourself now [18:56:49] ori: do you think that "drop: OCG, Flow. downsize: search, analytics" because of "we want to automate more" is going to be a choice upper mgmt will make? [18:57:23] i think upper mgmt is more responsive than we'd sometime like to admit [18:57:29] or to put it another way [18:57:35] when you were talking about a shift in attitude [18:57:50] how broadly did you envision this shift of attitude happening? [18:57:58] bd808: Oh, good. I'm not the only one who thought that (re the "reorg"). [18:58:21] I thought it was a criticism to bryan, myself, ops, etc., not of half the org [18:58:38] so maybe that's where our difference is, fundamentally [18:58:54] when I said "resourcing" I didn't mean "WMF doesn't hire enough people", I meant "ops doesn't have enough people" [18:59:10] in my head it's a criticism for me first and foremost [18:59:14] (hence my breakdown of cloudflare's staff) [18:59:35] i think i need to change my habits and model what i think is the right approach [18:59:48] and make the case that way rather than by complaining [19:00:12] no, complaining is useful too [19:00:29] I'd like to hear honest unfiltered complaints against me or my team for sure [19:00:55] moving to 5, brb [19:01:11] but at the same time I tend to be more interested in the complains that i can fix, rather the ones that take the agreement of 8 mid/upper-level managers to fix :P [19:01:47] not because I'm not willing to fight for it, but because these fights don't seem to be very succesful so far [19:03:53] i think we can inspire a shift in attitude. i think ops can improve the acoustics around monitoring even given the current allocations. [19:04:22] how? [19:05:20] if i were you i'd adopt a three strikes policy with respect to alerts [19:05:32] the third time an alert is not handled it gets removed from puppet [19:06:12] =happy outages without alerts [19:06:19] they won't be happy [19:06:28] it'll almost certainly result in some useful alerts getting dropped and some failures not getting caught [19:06:50] but it'll force people to re-evaluate how much thought needs to go into monitoring [19:07:14] your problem wasn't that alerts aren't being handled, it was that they got handled too often, no? [19:07:23] your problem statement, rather [19:07:28] it is a problem, I don't disagree :) [19:07:40] no, there are two separate problems [19:07:52] is the idea that there is no such thing as an informational irc msg and any irc message needs to be expected to result in someone investigating? [19:08:55] yes. our infra is too broad. the fact that something is interesting is not enough to offset the distraction it introduces [19:09:21] there are interesting things happening everywhere all the time [19:09:34] In the case of the kafka notice above re: braodness would it make more sense to drop that in analytics and not ops [19:10:07] not really, no [19:10:07] but analytics is on #wikimedia-operations too, it's not like they didn't see it [19:10:25] they just didn't act. so i'm not sure it's a matter of it going to the right channel [19:10:45] the point being we don't need to [19:10:56] don't need to what? [19:11:03] I was responding to the distraction component [19:11:21] and saying we don't need to what? [19:11:37] don't need to see the message in -ops to have it be a message and tehn are not distracted [19:11:47] assuming analytics finds it useful [19:12:04] I don't really agree with the premise persay for irc but trying to understand the problem from your perspective [19:12:16] it's information-hoarding [19:12:28] well, wouldn't they be the best judge of that? [19:12:33] considering they have their own ops and all [19:12:55] no, they could be wrong [19:13:17] well if analytics is wrong about their own alerting and the value then we have that an issue first I guess [19:13:17] no, kafka is not strictly an analytics concern [19:13:25] it's not "their" alerting [19:13:30] ^ that [19:13:36] (not their alerting) [19:13:41] and not strictly an analytics concern [19:14:25] is there any subsystem that is strictly for that a non-core-ops team or is everything globally important to where we cannnot have independent informational messaging [19:15:16] I thought the proposal a moment ago was no one cares and we should nix it after 3 strikes so moving it to analytics and letting them determine the value beyond ours [19:15:16] seems less severe [19:15:24] it can't share the same bus as urgent emergencies [19:15:45] it's great that aunt sally calls to let you know how her cat is doing, but she shouldn't use the red telephone for that [19:16:02] I didnt think irc was the medium for urgent emergencies at all [19:16:04] I mean, paging should be that mechanism? [19:16:19] irc is by it's nature more informational than functional for emergencies [19:16:41] i don't agree [19:16:52] also [19:16:53] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [19:17:05] what's the signal to noise ratio of that message? [19:17:14] looking at it purely from an "informational" angle [19:17:18] how many examples of that do you have though? [19:17:21] wtf does " Less than 1.00% above the threshold [1000000.0]" mean? [19:17:32] yes, check_graphite checks are in a very sad state in general [19:17:38] they're at UNKNOWN all the time too [19:18:06] most of those style notifications are opaque to me as well, I have talked with brandon about deciphering them a few times [19:18:25] neither of us had any good insight into what the alerts actually //meant// [19:18:46] honestly, I'm losing track of this discussion [19:19:24] * ori shrugs. [19:19:25] yes, confusing alerts is a problem and we should fix them -- what does that has to do with our monitoring infrastructure though? [19:19:45] we're discussing too many different issues at once [19:20:05] multiple good points have been risen, but what is this discussion about? :) [19:20:14] i'll try to tie it together [19:21:10] if the production databases got corrupted, nobody would think to say "well, we don't have a data integrity team, so i guess we're not fixing it". we'd drop everything and fix it, because we understand that responding to failure trumps other commitments [19:21:38] we need to adopt a broader notion of what failure means [19:22:11] the short term and maybe mid term impact will be that our progress on various projects and quarterly commitments etc would be decimated [19:22:57] but unless we take the hit and acknowledge to ourselves that the state of our monitoring infrastructure is currently in near "outage" state (i.e., requiring immediate out-of-band attention) we'll never fix it [19:25:42] i don't think I can acknowledge things I don't agree with :) [19:27:06] what we have been doing is securing resources to work on it, as evident from our roadmap in august, and our goals recently [19:27:22] not nearly as much as we'd like [20:39:45] bd808: I think I've some idea why the msg_resouce row count is so high [20:41:26] oh? [20:44:00] mr_lang and count [20:44:00] | À®À®À¯À®À®À¯À®À® | 222 | [20:44:00] | À®À®ÀœÀ®À®ÀœÀ®À® | 15 | [20:44:04] | thishouldnotexistandhopefullyitw | 57 | [20:44:14] | response.write(9968966*9731283) | 1 | [20:44:14] | response.write(9970403*9303969) | 15 | [20:44:14] | response.write(9995674*9319182) | 2 | [20:44:54] 1671 rows in set (3.02 sec) [20:45:44] We have 370 language supported in MW [20:46:32] I think we need a DELETE FROM msg_resource NOT IN $supported_languages [20:48:03] huh. so we accumulate new rows when people put random things into uselang=foo? [20:48:13] I'm not sure [20:48:16] But it seems like it [20:50:37] Reedy: there's a bug for that [20:50:46] that Krinkle filed [20:50:52] (don't have a link handy) [20:51:14] heh, I thought I saw the "thisshouldnotexist" from him before [20:51:20] Reedy: people actually use this for legitimate stuff. guess which wiki is crazy enough for that [20:51:34] commons! [20:51:41] bingo [20:54:43] um [20:54:45] This is scary [20:55:07] | en); if(95=95) select 1 else dro | 6 | [20:55:07] | en); if(99=63) select 1 else dro | 2 | [20:55:07] These look like sql injection attempts [20:55:56] DELETE FROM msg_resource WHERE mr_lang NOT IN array_keys( Language::getLanguageNames() ) [20:56:21] ?uselang=little_bobby_tables [20:57:13] Krinkle: Do you know which task it is for clearing crap out of msg_resource? [20:57:31] Reedy: https://phabricator.wikimedia.org/T113092 [20:57:46] Reedy: That was already done though [20:57:53] What was already done? [20:57:57] It now enforces isBuiltInLanguageCode [20:58:07] but may not be deployed yet and/or stale data will remain stale [20:58:10] because the table has no ttl [20:58:14] Ah, so we just need to clear out the crud? [20:58:19] I'll hack up a maintenance script to clean it [20:58:20] Yeah [20:58:29] keeping only getLanguageNames should be fine [20:58:29] Should it go in MW or WikimediaMaintenance? [20:58:36] that means custom uselang values will be repopulated shortly after [20:58:50] Reedy: eval? [20:58:50] :P [20:59:00] WikimediaMaintenance [20:59:01] surely it needs foreachwiki? ;) [20:59:02] or rather [20:59:21] foreachwiki eval.php --script='MessageBlobStore::clear()' [20:59:32] Don't do that! [20:59:35] I was gonna say [20:59:48] It will be possible in the new model [20:59:50] Won't that kill EVERYTHING? :P [20:59:58] but the current model would break db contention [20:59:59] that's fine [21:00:01] it's a cache [21:00:09] We're moving it to memcached [21:00:19] Patch is 99% done. [21:00:28] But won't go out until after fundraising [21:25:15] lol [21:25:23] MatmaRex: Commons seems to have well over 250k of these [21:25:36] of whese [21:26:36] 310k [21:27:54] bd808: Shall we take bets on how much quicker refreshMessageBlobs is? :D [21:28:07] Did you fix the timing display issue on l10nupdate-1? [21:28:55] I put up a patch for it [21:29:11] https://gerrit.wikimedia.org/r/#/c/256754/ [21:29:57] 380k on commons [21:30:09] wow [21:30:19] with a waitforslavelab after each? [21:30:24] *lag [21:30:45] deleting 100 at a time [21:30:48] with waitforslave [21:31:08] 400k [21:34:32] 21:28:24 Could not load some extensions because they are missing [21:34:32] 21:28:24 the expected entry point: [21:34:32] 21:28:24 [21:34:32] 21:28:24 WikimediaMaintenance [21:34:51] Where do I kill that? [21:34:52] 456814 on commonswiki :~ [21:37:33] Considering it's actually not an entrypoint [21:37:35] I'll readd it [21:38:15] hmmm... dunno [21:38:26] legoktm might know the spot [21:38:44] o.O [21:39:10] Reedy: you can create a stub extension.json file with just "name" and "manifest_version": 1 [21:39:17] I guess it's checking for $extensionname/$extensionname.php [21:39:38] ah, that might work [21:39:39] yeah [21:41:40] thanks [21:54:00] | 154363 | [21:54:04] Still a lot of rows on enwiki [21:56:03] bd808: wfWaitForSlave calls reduced in https://gerrit.wikimedia.org/r/256780 :P [21:59:55] That's a 25% reduction though [22:13:07] Anyone want to do my 3 easy wfBaseConvert() commits? https://gerrit.wikimedia.org/r/#/q/owner:Reedy+status:open,n,z [22:22:52] * Reedy hugs legoktm [22:23:01] :) [22:56:12] Reedy: some easy reviews -- https://gerrit.wikimedia.org/r/#/projects/wikimedia/slimapp,dashboards/default [22:56:36] will have a look in a couple of mins :) [23:19:03] bd808: [23:19:15] i know right? [23:19:33] also every other language if you use it long enough [23:19:44] Haha, true [23:19:48] * bd808 only hates the ones he loves [23:21:15] Why is https://gerrit.wikimedia.org/r/#/c/256846/ a backport? [23:21:23] I can't see the code in https://github.com/wikimedia/wikimedia-wikimania-scholarships [23:21:33] Or was it already removed? [23:21:47] it was from before I ripped the local code out and replaced it with the lib [23:21:56] ah [23:21:56] let me find the original for you [23:22:07] thanks [23:23:02] https://github.com/wikimedia/wikimedia-wikimania-scholarships/commit/9dea429f65e007f3e043e1068d536dc291dcf34d [23:23:58] that library was extracted out of iegreview which was a cut-n-paste fork of scholarships but then they diverged [23:24:13] I didn't catch all of the new features in scholarships apparently [23:24:39] soon they will both use the lib and this will be easier to keep in sync [23:24:56] Woo, programming [23:29:34] "Slowvote is based on a Facebook application called Quickvote, which was pretty good when it was originally written but went through a couple of poorly executed rewrites and ended up being extremely slow and glitchy, often taking more than 10 seconds to load." [23:29:40] "I originally designed and implemented the entire Slowvote application while waiting for a Quikvote to load." [23:31:24] TimStarling: Do you work for Facebook now? [23:31:40] no, I'm quoting someone else who wrote that [23:32:03] I was joking. It sounds like something you'd say/do :) [23:32:03] just reading about slowvote after chad suggested it for T118932 [23:32:16] yup [23:37:57] anyway, I apparently can't access slowvote on wikimedia's phabricator and can't configure it either [23:38:08] and https://phabricator.wikimedia.org/config/ is a 403 [23:41:45] TimStarling: i made you an admin, try again (cc greg-g) [23:42:12] thanks [23:43:12] Think we can get a PHPSUCKS tag on Phabricator? [23:43:59] ori: ah, gotcha [23:44:14] Reedy: is already there ;) [23:47:35] Reedy: For our local copy of HHVM's git repo? [23:51:54] we found a bug that looks to be php runtime strangeness with call_user_func_array() and reference values [23:51:59] annoying [23:58:10] can non-adminstrators see this? https://phabricator.wikimedia.org/V3 [23:58:19] TimStarling: No. [23:58:28] TimStarling: "This object has a custom policy controlling who can take this action." [23:58:47] does anyone know why the application is restricted? is it just because everyone hates voting?