[07:40:30] Analytics-Kanban: Back-fill pageviews data for dumps.wikimedia.org to May 2015 - https://phabricator.wikimedia.org/T126464#2133298 (elukey) Open>Resolved [07:41:53] Analytics, Datasets-General-or-Unknown, Operations, Traffic: http://dumps.wikimedia.org should redirect to https:// - https://phabricator.wikimedia.org/T128587#2133313 (elukey) @ArielGlenn Hello! Really curious about the document that Daniel pointed out above.. Is it impossible to serve dumps only... [08:20:06] Analytics, Datasets-General-or-Unknown, Operations, Traffic: http://dumps.wikimedia.org should redirect to https:// - https://phabricator.wikimedia.org/T128587#2079753 (Peachey88) Appears it was previously on a separate certificate https://wikitech.wikimedia.org/w/index.php?title=Httpsless_domains... [10:24:36] a-team I'm gone for lunch, back in while [12:35:05] (CR) Mforns: [C: 2 V: 2] "LGTM!" [analytics/dashiki] - https://gerrit.wikimedia.org/r/278025 (https://phabricator.wikimedia.org/T130069) (owner: Nuria) [12:38:45] (PS3) Mforns: Add labels to the hierarchy sunburst [analytics/dashiki] - https://gerrit.wikimedia.org/r/277733 (https://phabricator.wikimedia.org/T124296) [12:39:00] (PS2) Mforns: Highlight greatest leaf of hierarchy by default [analytics/dashiki] - https://gerrit.wikimedia.org/r/277746 (https://phabricator.wikimedia.org/T124296) [13:25:27] mforns: helloooo [13:25:40] I had to work with ops this morning, now lunch and then if you have time EL restart? [13:38:12] heyall [13:38:46] milimetric, hi! [13:38:59] hi mforns [13:46:14] mforns: back1 [13:46:20] do you have time? [13:46:21] hi elukey [13:46:29] elukey, yes! [13:46:35] oooookkkkkkk thanksssss [13:46:40] batcave? [13:47:49] sure! [13:48:44] Analytics-Kanban, Editing-Analysis, Patch-For-Review: Re-enable the edit analysis dashboard - https://phabricator.wikimedia.org/T126058#2003512 (mforns) Looking at the report files in stat1003, it seems that reportupdater is catching up: edit-analysis.wmflabs.org [13:48:51] elukey, just a min [13:53:38] goood morning [13:53:43] joal: , yt? [13:55:33] Hey ottomata [13:55:41] There I am [13:56:01] ottomata: wasup? [13:56:21] heeeee [13:56:32] https://phabricator.wikimedia.org/T127351#2131176 [13:56:43] i rememeber we talked about which topics to add partitoins to [13:56:48] do we want to just to text and upload? [13:57:02] the others don't need more partitions, but i'm not sure if we want to keep it consistent or if it matters [14:00:25] ottomata: I think text and upload are the only ones yes [14:00:41] Wealso need to change camus map numbers (would be goo) [14:00:56] yeah [14:01:15] it'd be nice if we could have some metric to see how this helps [14:01:20] maybe camus run time? [14:01:54] ottomata: if you have time mobrovac needs you in security :) [14:02:07] !log restarted eventlogging hosts [14:02:16] k [14:02:22] hm, camus run time is defined as 9 mins by property, that wouldn't help ottomata [14:02:28] oh ja [14:02:28] but [14:02:29] hm [14:02:34] it usually runs faster than that, right? [14:02:42] ottomata: file size/number of files is another good one [14:02:48] well, we know that will change [14:03:18] mforns: EL eqiad is up and running [14:03:33] elukey, I saw that, it seems to be vvvorking :] [14:04:49] \o/ [14:04:51] goooood [14:05:11] moritzm: event logging hosts rebooted [14:05:21] ok, thanks [14:17:02] PROBLEM - Difference between raw and validated EventLogging overall message rates on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [30.0] [14:17:17] ouch [14:17:44] going back to normal https://grafana.wikimedia.org/dashboard/db/eventlogging [14:19:22] elukey, yes, that metric has a bit of inertia, but it's fine now [14:28:11] RECOVERY - Difference between raw and validated EventLogging overall message rates on graphite1001 is OK: OK: Less than 20.00% above the threshold [20.0] [14:31:04] ottomata: one way to measure camus improvement would be to stop it for an hour and measure how long it takes to backfill it later [14:31:14] ottomata: then apply the change and measure again [14:31:25] Cause I guess that's the typica example [14:32:49] hm [14:44:10] ottomata: question of day from the ops-newbie: do we have an hadoop replica in deployment-prep AND labs or only labs? [14:44:21] haha [14:44:23] uhhhh [14:44:40] i do not understand your question [14:44:40] https://wikitech.wikimedia.org/wiki/Labs_labs_labs [14:45:06] comeeeeonnnnnn [14:45:31] hhahah [14:45:41] deployment-prep == beta cluster in wikimedia labs [14:45:52] that is where mediawiki runs [14:45:58] in labs, yes [14:46:08] there is also a hadoop cluster in the analytics project in wikimedia labs [14:46:17] that is more for just testing out puppet changes and upgrades, etc. [14:46:27] a101, etc. [14:46:41] ok now it makes more sense [14:46:44] but, that cluster is more temporary than the one in beta clsuter [14:46:56] I know that you have probably already told me this, but I forgot :P [14:48:40] haha, it is confusing [15:01:50] Analytics-Kanban: Integrate new browser visualization into wikistats - https://phabricator.wikimedia.org/T129101#2134077 (mforns) A couple questions: - I found http://stats.wikimedia.org/wikimedia/squids/SquidReportOperatingSystems.htm. It's like the SquidReportClients but for OS analysis. I think we cover... [15:12:34] a-team: I just pushed my dumps clean-up patch for review, I probably won't make standup, more problems with the new place. I'm going to work on reviewing joal's new AQS endpoint and work with Jon on dashiki a bit [15:19:39] Analytics, Research-and-Data, Research-management: Draft announcement for wikistats transition plan - https://phabricator.wikimedia.org/T128870#2134093 (DarTar) [15:19:56] hmm, joal a recent camus run only took 5 minutes [15:20:10] i guess let's just use that as a rough bench! as long as it doesn't get worse :p [15:22:28] ottomata: works for mre :) [15:23:15] joal: https://gerrit.wikimedia.org/r/#/c/278288/ [15:23:26] will increase partitions after standup and merge [15:24:26] (CR) Joal: [C: 1] "I dropped previously created table on hive, in case." [analytics/refinery] - https://gerrit.wikimedia.org/r/278193 (https://phabricator.wikimedia.org/T108618) (owner: BryanDavis) [15:25:04] (CR) Ottomata: [C: 2 V: 2] Fix ApiAction record name [analytics/refinery] - https://gerrit.wikimedia.org/r/278193 (https://phabricator.wikimedia.org/T108618) (owner: BryanDavis) [15:26:22] I feel like a dumb ass for getting those names messed up :/ [15:31:56] a-team: Standup? [15:33:01] madhuvishy: standup? [15:33:08] madhuvishy: standuppp [16:01:55] Analytics: Send burrow lag statistics to statsd/graphite {hawk} - https://phabricator.wikimedia.org/T120852#1862859 (Ottomata) For reference: https://github.com/linkedin/Burrow/issues/4 [16:11:19] madhuvishy: i'm going to merge this, ja? https://gerrit.wikimedia.org/r/#/c/276615/3/manifests/role/eventlogging.pp [16:11:29] think i just forgot to do it earlier this week [16:11:31] its been long enough [16:11:35] madhuvishy: let me know if your received invite for cubernetics meeting [16:18:34] Analytics, Analytics-Cluster, Operations, hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#2134264 (Ottomata) Open>declined The other analytics hardware requests are currently in pending approval will take up the most of the analytics remaind... [16:21:29] Analytics, Analytics-Cluster: decide how to monitor the cluster {hawk} - https://phabricator.wikimedia.org/T91991#2134269 (Ottomata) Open>declined Not quite sure what this is about, declining. [16:21:31] Analytics, Analytics-Cluster: Epic: qchris transition - https://phabricator.wikimedia.org/T86135#2134271 (Ottomata) [16:21:42] Oh a-team, forgot to mention at standup yesterday and today: halfak has read and commented the sanitization page - ball is in my side, I have some modifs to do and we should be good :) [16:21:58] Analytics-Cluster: Logstash is not working - https://phabricator.wikimedia.org/T86065#2134273 (Ottomata) Open>Invalid stale? [16:22:05] cool joal [16:22:06] * halfak loves "a-team" [16:22:12] Both the team and the shortened name :) [16:23:26] Analytics-Cluster: Epic: AnalyticsEng has kafkatee running in lieu of varnishcsa and udp2log - https://phabricator.wikimedia.org/T70139#2134275 (Ottomata) Open>Resolved a:Ottomata [16:23:55] :) [16:25:28] ottomata: would like your opinion on load change [16:25:50] Analytics: Prototype Data Pipeline on Druid - https://phabricator.wikimedia.org/T130258#2134283 (Nuria) [16:26:02] ottomata: I think I should remove completely the existing faulty_host thing and have a new one about data_loss [16:26:08] (or renaming, same) [16:26:19] Do you agree with that? [16:26:20] +1 [16:26:24] awesome [16:26:48] i think the faulty host was originally for some extra monitring stuff, so we could use it to generate icinga alerts for specific hosts (not sure about that) but we have abandoned that route [16:27:25] (CR) Ottomata: [C: 1] "Have only done a quick read through, but looks fine to me." [analytics/refinery] - https://gerrit.wikimedia.org/r/277507 (https://phabricator.wikimedia.org/T129519) (owner: Joal) [16:27:44] (Abandoned) Ottomata: Add pom.xml to parent ua-parser directory [analytics/ua-parser] - https://gerrit.wikimedia.org/r/277335 (owner: Madhuvishy) [16:33:35] !log restarting eventlogging to remove server-side-raw forwarder [16:38:25] ottomata: elukey and I are talking about testing namenodes changes on labs - since he has to apply his puppet patch to multiple nodes and the ones in analytics project should be made self-hosted for applying the patch - I am suggesting he uses the beta cluster analytics nodes [16:38:33] so as to cherry pick [16:39:01] yeah probably it is better as madhuvishy suggests [16:39:42] Analytics-Tech-community-metrics, DevRel-March-2016: Many profiles on profile.html do not display identity's name though data is available - https://phabricator.wikimedia.org/T117871#2134343 (Lcanasdiaz) >>! In T117871#2119548, @Aklapper wrote: > @Lcanasdiaz: Hmm. > Going to http://korma.wmflabs.org/brows... [16:40:18] yeah i htink that is easier [16:40:23] i mean, we can make them self hosted [16:40:23] Analytics-Tech-community-metrics, DevRel-March-2016: Make GrimoireLib display *one* consistent name for one user - https://phabricator.wikimedia.org/T118169#2134347 (Lcanasdiaz) Resolved>Open [16:40:25] Analytics-Tech-community-metrics, Developer-Relations, DevRel-March-2016: Who are the top 50 independent contributors and what do they need from the WMF? - https://phabricator.wikimedia.org/T85600#2134350 (Lcanasdiaz) [16:40:27] Analytics-Tech-community-metrics, DevRel-March-2016: Many profiles on profile.html do not display identity's name though data is available - https://phabricator.wikimedia.org/T117871#2134349 (Lcanasdiaz) [16:40:29] when i was testing the cdh5.5 upgrade [16:40:33] i made a cluster that was self hosted [16:40:35] and tested that way [16:40:37] after it was merged [16:40:44] i deleted that cluster, and then made another that was not self hosted [16:40:56] but, ja, i think easier to test in beta [16:41:00] for this change [16:41:32] Analytics-Tech-community-metrics, DevRel-March-2016: Make GrimoireLib display *one* consistent name for one user - https://phabricator.wikimedia.org/T118169#1793275 (Lcanasdiaz) @Aklapper found a new bug related to this topic. I'm working on it. > Going to http://korma.wmflabs.org/browser/mls.html under... [16:42:25] ottomata: yeah - that's what I thought too [16:42:40] because its more than 1 node [16:42:50] ottomata: thanks for merging the server side forwarder changes [16:43:44] sure! madhuvishy here's another quick one [16:43:45] https://gerrit.wikimedia.org/r/#/c/278303/ [16:45:25] ottomata: merged [16:45:32] danke [16:45:45] Analytics-Tech-community-metrics, DevRel-March-2016: Make GrimoireLib display *one* consistent name for one user - https://phabricator.wikimedia.org/T118169#1793275 (Lcanasdiaz) >>! In T118169#2130570, @Aklapper wrote: > Looking at http://korma.wmflabs.org/browser/top-contributors.html I see e.g. "Catrope... [16:47:35] Analytics-Kanban: Integrate new browser visualization into wikistats - https://phabricator.wikimedia.org/T129101#2134417 (Nuria) I have sent an e-mail to Erik Z. on this regard, let's see if he has a suggestion regarding how to proceed. [16:58:48] ottomata: what about spinning up a self hosted puppet master and change the /etc/puppetlabs/puppetserver/conf.d/puppet.conf file on a101/a102 to point to it? [16:59:04] temporarily [16:59:17] elukey: if you can make it work, that'd be fine i think [16:59:27] but, i think you'd spend less time just cherry picking in beta [16:59:43] ah yes but I don't want to make a mess [16:59:50] naww you won't, it'll be fine [17:00:12] no one really uses either of those clusters, and we can always make a new one if we have to [17:02:19] yes elukey have fun and don't worry about breaking our cluster in labs or beta! [17:03:51] I was more worried about messing up the puppet master but I'll do it :) [17:06:43] nuria, can I send you the mediawiki bot convention email draft to you for review? [17:11:58] mforns: you can cc me too if you want [17:12:21] I'm back in action btw [17:12:24] milimetric, thanks, I thought you were solving new place's stuff [17:12:30] ok [17:12:54] yep, there was a monster pipe completely rusted and cracked I had to help replace. I think we're good now :) [17:13:02] jdlrobson: we can hang out and hack if you're up for it [17:14:56] hey milimetric i'm still warming up this morning (on my coffee) [17:15:07] i've got a few reading web stuff to review but yeh let's hang out when that's all done :) [17:15:08] all good, whenever, I'm around all night [17:16:37] milimetric, cool, exciting! [17:18:39] hey nuria. question: is all site browser reports for Wikipedia or for other projects, too? [17:19:11] ottomata, elukey, those burrow alerts, I can't see anything wrong... [17:19:41] mforns: sure, send it along [17:19:49] nuria, I sent it already :] [17:20:04] leila: it's all projects that form the pageview definition [17:20:28] perfect. thanks, nuria. [17:20:37] leila: as such since enwiki it's the one with the biggest # of pageviews is the project with the biggest influence on numbers [17:20:52] yup yup, nuria. :) [17:21:06] mforns: I was about to ping you [17:21:17] elukey, aha [17:21:38] leila: but it includes all projects listed here: https://github.com/wikimedia/analytics-refinery/blob/master/static_data/pageview/whitelist/whitelist.tsv [17:21:39] the first one was about consumers progressing too slow, the last one is a bit more serious [17:21:41] madhuvishy: ahhh [17:21:44] STALL [17:21:45] they are because i stopped the forwarder [17:21:46] on it [17:21:54] ah ok! [17:22:02] buuuuuuuuuuuuu [17:22:12] ottomata: you don't care about your worried colleagues [17:22:16] xD [17:22:18] :P [17:22:19] leila: we are working on documenting the data , there is a new table in hadoop that holds this data [17:22:20] halfak: I cc-ed you on https://phabricator.wikimedia.org/T130256 because it's what we've been dreaming on for FOREVER :D [17:22:25] browser_something [17:22:27] HM [17:22:48] so halfak, sharpen those steely knives, we're gonna kill the beast next quarter [17:23:03] * milimetric is not sure if he understands that song [17:23:22] AH, hm [17:23:23] h [17:23:29] i'm not sure if I can get rid of those alerts [17:23:32] i think we just have to wait for it to go stale [17:23:42] madhuvishy: hm, we should note this in the future [17:23:57] different el consumers should all use different consumer groups, even if they consume different topics [17:25:03] we used the same consumer group name for multiple services, and burrow monitoring is based on a single consumer group [17:25:12] so i can't tell it to stop those emails without turning off monitoring for the other ones [17:25:53] nuria: , mforns [17:25:58] can you confirm that you don't need full sudo? [17:25:58] https://phabricator.wikimedia.org/T130226 [17:25:59] do you? [17:26:29] I don't think so [17:27:23] but I'm not sure [17:30:09] ottomata: righttt [17:31:06] ottomata: full sudo no [17:31:20] ottomata: but i bet we will need to modify wikistats files that are owned by root [17:35:11] we can change that [17:35:15] they should be owned by stats [17:36:32] ottomata: ok, then whatever permits you think are good will do, me no want to have no permits [17:36:40] more permits== more chances to screw up [17:39:35] milimetric, \o/ [17:39:58] Will be very nice to start moving some of my PROCESS THE ENTIRE REVISION TABLE ACROSS WIKIS style code to hive/spark. [17:40:57] hey folks, i'm struggling with logstash. can y'all help? i'm not even sure if what i'm looking for is in thereā€¦ [17:41:23] MatmaRex: what are you looking for? [17:41:37] bd808: outputs of this wfDebug() call: https://github.com/wikimedia/mediawiki/blob/master/includes/api/ApiUpload.php#L338 [17:41:53] wfDebug isn't in logstash [17:42:00] or on fluorine [17:42:14] wfDebug goest to /dev/null in prod [17:42:24] hm, okay. [17:42:26] bye a-team, have a nice weekend! [17:42:33] ciaoooooo [17:43:23] MatmaRex: to end up on disk/in logstash you need wfDebugLog with a channel name and then to have that channel enabled in InitializeSettings [17:43:44] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:45:04] milimetric: yt? [17:45:28] Bye mforns ! [17:45:40] bye :] [17:47:01] hey nuria what's up [17:47:18] milimetric: how can I know the user/pw to mysql for piwik machine [17:48:06] oh right, I think I just sudo into it lemme check [17:48:41] I am running puppet btw [17:48:50] on bohrium? [17:48:55] you update something? [17:49:49] milimetric, nuria: is the icinga alert due to your work? [17:49:58] nuria: not sure how to login to sql there [17:50:03] elukey: not me, I'm just reading email [17:50:12] elukey: waht icinga? [17:50:19] piwik is dow down [17:50:19] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:50:26] ahhh okok [17:50:35] someone ran links from /etc/apache2/sites-enabled [17:50:41] which created /etc/apache2/sites-enabled/.links2 [17:50:42] i might have woken it up... [17:50:55] which puppet then purged, triggering a refresh of the apache2 service [17:51:09] Mar 18 17:37:14 bohrium puppet-agent[1379]: (/Stage[main]/Apache/File[/etc/apache2/sites-enabled/.links2]/ensure) removed [17:51:09] Mar 18 17:37:14 bohrium puppet-agent[1379]: (/etc/apache2/sites-enabled) Scheduling refresh of Service[apache2] [17:51:10] Mar 18 17:37:22 bohrium systemd[1]: Reloading LSB: Apache2 web server. [17:51:56] also, umm [17:51:58] Mar 18 17:35:58 bohrium /etc/mysql/debian-start[1197]: WARNING: mysql.user contains 3 root accounts without password! [17:52:00] ??? [17:52:31] ori: I didn't get if piwik.w.org was down on purpose or not [17:52:32] :D [17:52:37] (sorry missing context) [17:52:51] not on purpose, i just ssh'd in to try and figure out why it stopped responding [17:52:58] and saw what i pasted above in syslog [17:53:10] someone ran links from /etc/apache2/sites-enabled -> which created /etc/apache2/sites-enabled/.links2 -> which puppet then purged, triggering a refresh of the apache2 service [17:55:05] actually, it's worse than that [17:55:15] puppet is not expecting to find directories there, so it is not removing it [17:55:20] but it is still refreshing the service [17:55:24] so that is happening on every puppet run [17:55:31] i'm going to rm -f it [17:56:41] ori: ok [17:56:59] ori: it is also friend I think because the number of requests for Ios app [17:58:16] the .links2 thing was only happening for the last two puppet runs, so fortunately it is not a longstanding issue [18:06:29] a-team I am logging off, have a good weekend! [18:06:41] night elukey! happy weekend :) [18:06:53] Bonn weekend elukey :) [18:08:11] laters! [18:16:48] ottomata: how do I manually upload a jar to archiva? [18:17:33] https://wikitech.wikimedia.org/wiki/Archiva#Uploading_Dependency_Artifacts [18:17:41] do like is says there [18:17:44] but not in mirroerd repo [18:17:47] use release [18:19:04] ottomata: cool thanks [18:19:24] I was able to delete wmf3 easy [18:19:34] should i re-release wmf3 or bump it up? [18:19:50] joal ^ too if you're around? [18:20:32] madhuvishy: I'd rather use a new one, to prevent any change-issue related, but we can try replacing if you prefer [18:21:00] no preference [18:21:27] joal: since it's not used anywhere - i feel like it's safe - i'll upload wmf3 - import in refinery - if any trouble - can bump it up? [18:21:53] sure, uour call :) [18:25:19] Analytics-Kanban: Modify oozie Webrequest-Load job not to fail in case of minor statistics error - https://phabricator.wikimedia.org/T130187#2134786 (JAllemandou) a:JAllemandou [18:28:09] (PS1) Joal: [WIP] Change webrequest oozie load job to fail in data loss case [analytics/refinery] - https://gerrit.wikimedia.org/r/278325 (https://phabricator.wikimedia.org/T130187) [18:28:29] a-team, gone for the weekend ! [18:28:35] See you on monday :) [18:30:32] Analytics, Analytics-EventLogging: Upgrade eventlogging servers to Jessie - https://phabricator.wikimedia.org/T114199#2134818 (Ottomata) [18:36:09] ottomata: should we patch puppet with different consumer group names and restart it all? [18:38:26] madhuvishy: would be a good idea [18:38:29] should make a task to plan that [18:38:38] mostly will be fine, buuuut, it means that the consumer offsets will be reset [18:38:39] i dunno [18:38:44] Analytics-Cluster, Discovery, Maps, RESTBase-Cassandra: Create separate Kibana dashboards for production Cassandra clusters - https://phabricator.wikimedia.org/T130393#2134870 (Eevans) [18:38:46] ottomata: ah yes [18:38:48] hmmm [18:38:49] maybe we should just deal and remember when it is convenient to change them [18:40:01] will we keep getting burrow emails until then? [18:43:34] i think it will eventually be removed, no? [18:43:38] from the list of consumers? [18:43:39] not sure... [18:43:44] maybe we could delelte the znode? [18:45:17] hmmm [18:46:47] nuria: the secret way of accessing root without a pw on bohrium: [18:46:48] sudo mysql --defaults-file=/etc/mysql/debian.cnf [18:48:10] milimetric: can you look into this, seen in bohrium's syslog: Mar 18 17:35:58 bohrium /etc/mysql/debian-start[1197]: WARNING: mysql.user contains 3 root accounts without password! [18:48:27] what are the three, and what are the best practices there? it'd be good to ask jynus if passwordless root accounts are a good idea [18:48:27] yeah i saw that, was gonna look after this meeting [18:48:35] cool cool [18:48:37] thanks [18:52:53] Analytics: Operational improvements and maintenance in Eventlogging in Q4 {oryx} - https://phabricator.wikimedia.org/T130247#2134922 (Neil_P._Quinn_WMF) [18:53:27] Analytics: Operational improvements and maintenance in EventLogging in Q4 {oryx} - https://phabricator.wikimedia.org/T130247#2134924 (ori) [19:04:54] ori: since piwik was down due to load does it seem ok to implement rate limit at apache level to make sure service stays up under high load? [19:05:27] nuria: maybe; it depends on the resource that was saturated [19:05:31] what exactly happened? [19:05:52] ori: so those three "users" are just root@bohrium, root@127.0.0.1, and root@::1, as opposed to root@localhost [19:05:57] (which has a password) [19:06:09] ori: cpu was at 100% and after that nothing worked well but did not look further than that. [19:07:15] nuria / ori: I'm feeling bad that we're spending so much time on this. The whole deal with piwik was that they thought it could handle their workload with a minimal amount of ops. We have solutions that work for them, they just have to decide now, so we shouldn't spend too much more time thinking about it. Piwik's performance was a known going into this [19:07:37] we found that it's about 10 times better than they claim on their site, but still limited to no more than around 2 million insertions a day [19:07:49] Analytics-Kanban, Patch-For-Review: Update UA-Parser with latest definitions - https://phabricator.wikimedia.org/T129672#2112410 (madhuvishy) Uploaded new artifact - https://archiva.wikimedia.org/#artifact/ua_parser/ua-parser/1.3.0-wmf3 [19:08:00] so anyone who goes over that, needs to adjust, and if they can't adjust we have to adjust for them [19:08:01] milimetric: OTOH, there is a whole class of one-off requests for which it is applicable as a generic solution [19:08:13] totally, it's great, but not for high traffic [19:08:30] if something has 2 million hits a day it should be properly instrumented with eventlogging to pull raw data if it needs, or sampled [19:08:58] milimetric: i do not think we have spent a lot of time though, i think is worth knowing the limits so we know how many clients/use cases we can sustain [19:09:09] it's running on a VM with two virtual cores [19:09:21] app servers have 32, for comparison [19:09:30] i think a modest hardware upgrade would go a long way [19:09:41] right, I mean everything it's doing is purely CPU bound, the memory use is low and disk is good [19:09:43] ori: can we do one easily? [19:10:02] ori: but let's not kid ourselves, this is not made for millions of requests per day [19:10:18] yeah, going to 4 cores would probably keep them out of trouble for the medium future, but the main point is they don't actually need raw data [19:10:35] i'd also try out php7 or hhvm [19:10:46] 'perf top' shows most cpu time is spent in php [19:11:01] I think we can upgrade hardware sure, but let's have in mind that every tenant of the box cannot send unbounded requests [19:11:02] milimetric: yeah, you might be right [19:11:12] either way I think it's a little too close to overload for comfort [19:11:16] Analytics-Kanban: Update UA-Parser version to 1.3.0-wmf3 in refinery-source - https://phabricator.wikimedia.org/T130399#2135021 (madhuvishy) [19:11:20] so I think expanding capacity might be in order anyhow [19:11:25] but what you say sounds appropriate too [19:11:25] ori: how can we upgrade the box? [19:11:31] (PS1) Madhuvishy: Upgrade to latest UA-Parser version [analytics/refinery/source] - https://gerrit.wikimedia.org/r/278337 (https://phabricator.wikimedia.org/T130399) [19:11:43] nuria: I can just do it; hang on [19:11:55] I mean, if we want bare-metal server with lots of cores, that's a different matter [19:12:02] but I can increase the number of virtual cores pretty easily [19:12:32] ori: really?! PHP? There's some tweaks we can do for that but that doesn't match what I was seeing just watching it with htop, mysql was by far using the most CPU, the top 20 processes were almost all mysql [19:14:12] i did not observe it over time, so it could have been a fluke [19:14:46] Analytics-Cluster, Discovery, Maps, RESTBase-Cassandra, Patch-For-Review: Create separate Kibana dashboards for production Cassandra clusters - https://phabricator.wikimedia.org/T130393#2135049 (Eevans) I've queued up https://gerrit.wikimedia.org/r/278330 for [[ https://wikitech.wikimedia.org/... [19:15:30] ori: ya, let's get more cores if it is easy [19:20:09] nuria: done; can I reboot bohrium? [19:20:13] necessary for it to take effect [19:20:23] ori: yes [19:20:30] ok, doing [19:22:34] thx ori, yeah, i watched it on and off for a few hours, I think mysql was the most overworked (it makes sense because it's seeing 5 million inserts and only a few people are accessing the actual UI) [19:22:47] but it's true that while the UI renders the PHP processes do spike [19:23:35] OK, upgraded bohrium from 2 vCPUs to 8, and from 4G ram to 8 [19:23:48] it's back up now [19:24:47] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=bohrium.eqiad.wmnet&m=cpu_report&s=descending&mc=2&g=cpu_report&c=Miscellaneous+eqiad [19:25:51] even nicer: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Miscellaneous+eqiad&h=bohrium.eqiad.wmnet&jr=&js=&v=12.9&m=cpu_user&vl=%25&ti=CPU+User [19:26:52] awww so nice [19:28:22] poor piwik. i'm happy for it [19:28:38] we should also convert the database to tokudb [19:28:39] http://piwik.org/faq/how-to-install/faq_20200/ [19:29:26] nuria did that recently for eventlogging! [19:29:57] ori: thank you! [19:30:04] np [19:31:04] i'm inclined to think that this should get a dedicated, bare-metal machine with SSDs [19:31:26] i know that you don't want to encourage uncontrolled use, but think of it in terms of your peace of mind [19:31:40] hah, speak of the devil [19:31:42] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:19] maybe that's a delayed alert from the reboot earlier? [19:32:29] ya that's what I'd guess [19:33:09] ori: getting a new machine doesn't mean we can not "bound" usage [19:33:18] right [19:33:45] we need to know how much data it can handle easily and their self-reported limit is 50 million per month [19:34:14] ad most [19:34:35] opcache is not enabled on bohrium either [19:34:47] ori: opcache? [19:35:06] ori: ah like apc? [19:35:09] yeah [19:35:15] ori: that one i have not used before [19:36:00] ori: but thst should be easy to enable via puppet setup of php [19:36:05] i can look into that [19:36:35] too late [19:36:47] https://gerrit.wikimedia.org/r/278341 [19:37:39] ;) [19:37:54] it'll refresh apache2 now so there may be another brief alert [19:38:49] jajaja [19:39:27] ori; well, now i know how to manage php init on puppet which i didn't know before [19:40:33] ottomata: are you guys doing work on eventlogging and that is why we have the burrow alerts? [19:42:11] ok gotta run [19:42:16] bohrium should have a lot more headroom [19:42:21] thx! [19:42:28] np, bye! [19:42:30] have a good weekend [19:42:33] you too [19:42:57] nuria: no - we shut down the consumer from the server-side topic but all the consumers have the same consumer group [19:43:05] so it keeps sending alerts [19:43:18] madhuvishy: waitt... [19:43:38] it is sending alerts because server side is not being consumed? [19:43:56] ^ madhuvishy [19:44:00] STALL eventlogging-server-side:0 (1458317811043, 361024567, 0) -> (1458318803352, 361024567, 0) [19:44:01] yes [19:44:18] even though it's different topics and consumer processes [19:44:26] they belong to the same consumer group [19:44:47] ottomata and I have been talking about eventually splitting out the consumer group names [19:45:06] but it'd need a restart and consumer offsets will be reset [19:45:34] i think eventually burrow will figure out this topic is not being consumed from anymore and stop sending emails [19:45:39] not sure though [19:46:47] madhuvishy: HOPEFULLY it will figure it out [19:46:53] yes [19:47:05] madhuvishy: let me look at burrows docs [19:48:22] madhuvishy: "The lagcheck expire-group configuration specifies when a consumer should be considered to have gone away permanently. [19:48:56] ah [19:49:05] yeah so it'll figure it out [19:49:40] 604800 [19:49:48] is the expire-group setting [19:49:51] a week [19:49:56] right? [19:49:59] some default [19:50:04] yes [19:50:07] in a week [19:50:12] ha ha [19:50:16] we can change it to a day [19:50:21] ottomata: ^^ [19:50:25] ya, i think so [19:50:41] to save electricity from e-mails being sent [19:51:00] :D [19:51:09] haha, could change it to 30 minutes for a bit, and then change it back [19:51:14] ottomata: yeah [19:51:17] i will just disable puppet real quikc and do that [19:51:22] even otherwise, 1 day may be okay? [19:51:26] ottomata: cool [19:52:05] naw, that would mean after a day of EL being down it would stop monitoring [19:52:22] right [19:52:24] ottomata: indeed [19:52:36] hopefully EL wouldn't be down for a day! [19:53:13] ok, i set it to 10 minutes [19:53:26] will wait 60+ minutes to see if we stop getting email [19:53:49] okay [19:54:06] it sends it every 30 minutes now [19:54:59] ja [19:55:09] just set an alarm to remind me to turn puppet back on! [19:56:04] :) [19:56:17] * madhuvishy needs to eat [19:56:43] ottomata: you may find https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/files/home/ori/.binned/disable-puppet useful, i use it all the time [19:56:55] 'disable-puppet 24h' will disable puppet for 24h, etc. [19:58:21] nice! [19:59:01] oh ha! [20:00:46] Analytics: Browser reports improvements (parent task) - https://phabricator.wikimedia.org/T130405#2135158 (Nuria) [20:01:32] (CR) Milimetric: [C: 2 V: 2] "This is good but once we're set with all the yamls on our end and before we deploy we should mirror them in the front-end restbase." [analytics/aqs] - https://gerrit.wikimedia.org/r/277740 (owner: Joal) [20:03:26] (CR) Milimetric: [C: 2 V: 2] "sweet" [analytics/aqs] - https://gerrit.wikimedia.org/r/277781 (owner: Joal) [20:03:28] Analytics: Add (and default to) a breakdown in percentages also for the line chart. - https://phabricator.wikimedia.org/T130406#2135172 (Nuria) [20:08:54] milimetric, nuria: still there? [20:09:04] hey ori what's up [20:09:08] ori: yes, in a meeting in abit [20:09:09] (I'll be here late today) [20:09:15] extra cpu and ram are nice, but not the main issue [20:09:17] disk is saturated [20:09:31] Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util [20:09:31] vda 0.00 2.40 14.40 165.00 300.00 5906.40 69.19 2.59 14.38 37.78 12.33 5.54 99.36 [20:09:36] oh it was good when i looked, i guess at that rate though it's bound to happen [20:09:37] and that is because mysql is doing filesorts [20:09:59] oh i see, not full, saturated [20:10:13] yeah [20:10:23] needs a bare-metal machine with ssd [20:10:49] wait, in terms of space? [20:10:53] https://www.irccloud.com/pastebin/QJurx7dk/ [20:11:05] ^ ori [20:11:09] no, I/O [20:11:15] oh i see, not full, saturated [20:11:18] that ^ :) [20:12:03] makes sense, we could probably tweak mysql like Corey was suggesting, increase some buffer sizes that'll make it use more memory [20:12:16] maybe ease up on the disk, or I'm sure there are other ways we can tune [20:12:46] but I think that's the kind of work we do on eventlogging [20:12:52] doesn't make sense to do it twice [20:12:55] it's futile, you won't get good disk throughput on a VM [20:13:05] needs a bare metal machine with an SSD [20:13:39] k, makes sense [20:13:56] coreyfloyd: this might be of interest ^ [20:14:16] (ori's been trying to save the piwik instance) [20:15:11] (PS1) BryanDavis: Update mediawiki/event-schemas submodule [analytics/refinery/source] - https://gerrit.wikimedia.org/r/278346 (https://phabricator.wikimedia.org/T108618) [20:54:40] Analytics: Browser reports improvements (parent task) - https://phabricator.wikimedia.org/T130405#2135388 (Krinkle) [20:54:45] Analytics-Cluster: Story: Community has periodic browser stats report generated from Hadoop data - https://phabricator.wikimedia.org/T69053#2135387 (Krinkle) [20:55:38] Analytics: Browser report has odd "Not named" labels - https://phabricator.wikimedia.org/T130415#2135407 (Krinkle) [21:04:36] milimetric: thanks - catching up - traveling today, but will try to review everything for when we are all back Tuesday. [21:06:30] are ya'll responsible for piwik? [21:06:31] 21:05 < icinga-wm> PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:08:13] greg-g: yes - thank you. I think folks are looking at it ^^ [21:11:09] Analytics: Get jenkins to automate releases {hawk} - https://phabricator.wikimedia.org/T130122#2135440 (madhuvishy) [21:11:46] Analytics: Get jenkins to update refinery with deploy of new jars {hawk} - https://phabricator.wikimedia.org/T130123#2135441 (madhuvishy) [21:46:46] (PS1) Jdlrobson: WIP: Allow filtering of data breakdowns [analytics/dashiki] - https://gerrit.wikimedia.org/r/278395 [21:49:25] Analytics: Make metrics-by-project breakdown interactive and bookmarkable - https://phabricator.wikimedia.org/T130255#2135530 (Milimetric) good start by @Jdlrobson here: https://gerrit.wikimedia.org/r/#/c/278395/ [22:36:23] Analytics-EventLogging, MediaWiki-extensions-MultimediaViewer: 60% of MultimediaViewerNetworkPerformance events dropped (exceeds maxUrlSize) - https://phabricator.wikimedia.org/T113364#2135707 (Krinkle) Open>Resolved a:Krinkle [23:41:58] Analytics, Wikipedia-iOS-App-Product-Backlog: Fix iOS uniques in mobile_apps_uniques_daily after 5.0 launch - https://phabricator.wikimedia.org/T130432#2135848 (Tbayer) [23:51:05] ha ha our jenkins jobs are called analytics-kraken