[05:49:14] Analytics-Kanban, Datasets-Webstatscollector, RESTBase-Cassandra: Better response times on AQS (Pageview API mostly) {melc} - https://phabricator.wikimedia.org/T124314#1952692 (Nemo_bis) Is this really unrelated from T125345? [05:53:52] Analytics-Kanban, Datasets-Webstatscollector, RESTBase-Cassandra: Better response times on AQS (Pageview API mostly) {melc} - https://phabricator.wikimedia.org/T124314#2283885 (Nemo_bis) [05:53:55] Analytics, Datasets-General-or-Unknown, Services: Many error 500 from pageviews API "Error in Cassandra table storage backend" - https://phabricator.wikimedia.org/T125345#2283884 (Nemo_bis) [06:10:04] Analytics, Pageviews-API: Invalid API input returns 404 instead of 500 or 400 - https://phabricator.wikimedia.org/T134964#2283907 (Nemo_bis) [06:10:58] Analytics, Datasets-General-or-Unknown, Services: Many error 500 from pageviews API "Error in Cassandra table storage backend" - https://phabricator.wikimedia.org/T125345#2283909 (Nemo_bis) [09:31:02] Analytics, Pageviews-API: Invalid API input returns 404 instead of 500 or 400 - https://phabricator.wikimedia.org/T134964#2283895 (Krenair) It's not an Internal Server Error (500) if you give bad input. [10:54:15] !log eventbus restarted on kafka100[12] for security upgrades [11:00:36] elukey: http://www.scylladb.com/ [11:00:57] elukey: Let's ask the team later on if it's worth having a look or not [11:13:43] scylla not prod material yet imho [11:13:49] s/not/is no/ [11:13:53] *not [11:13:54] grr [11:13:55] :) [11:13:58] :) [11:14:34] joal: https://phabricator.wikimedia.org/T125368 [11:16:31] mobrovac: thanks for the link :) [11:16:59] joal: I'm leaving now to go co-work with ottomata. It should give me plenty of time to make it there by 09:00 my time, but just in case I'm late, that's what's going on [11:17:11] np milimetric :) [11:17:14] Later ! [11:19:47] "Currently DTCS support is still weak" --> I would have some words about the one implemented in Cassandra too :P [11:19:50] joal --^ [11:19:58] elukey: :D [11:20:57] elukey: given that we're in "manual test mode" (kind of), maybe we could have a go and try it ? [11:21:35] joal: yep once we have puppet set up for basic multi instance we can do whatever test we want :D [11:22:06] this morning Alex solved a problem with the DHCP/VLAN configuration, so now I am able to do the PXE boot.. but still working on partman [11:22:07] elukey: I wonder about 10x throuput and latency reduction, want toi see it from y own eyes ;) [11:22:19] 10x?? [11:22:30] great elukey (you didn't follow ottomata advice to drop it ;) [11:23:06] not yet, I wanted to do it manually (this is why I discovered the DHCP issue this morning) but it is not that straightforward too :P [11:23:23] I think that Andrew did it partially in partman and the rest via script once imaged [11:23:27] hehehe :) [11:23:31] ok [11:23:55] (brb) [11:36:52] ottomata: From camus / hadoop, your kafka upgrade went ok :) [11:37:03] ottomata: HIIII ! Sorry :) [11:37:09] ahahahha [11:42:23] Analytics, Beta-Cluster-Infrastructure: deployment-aqs01.deployment-prep.eqiad.wmflabs doesn't respond to ssh / hung process - https://phabricator.wikimedia.org/T134981#2284433 (hashar) [11:46:14] Analytics, Beta-Cluster-Infrastructure: deployment-aqs01.deployment-prep.eqiad.wmflabs doesn't respond to ssh / hung process - https://phabricator.wikimedia.org/T134981#2284441 (hashar) It is back. Puppet is lagged out: The last Puppet run was at Sat May 7 04:51:29 UTC 2016 (6174 minutes ago). [11:48:11] Analytics, Beta-Cluster-Infrastructure: deployment-aqs01.deployment-prep.eqiad.wmflabs doesn't respond to ssh / hung process - https://phabricator.wikimedia.org/T134981#2284451 (hashar) Open>Resolved a:hashar Puppet log that auto started on instance boot: ``` Notice: /Stage[main]/Scap/Package... [11:56:10] Analytics, Community-Tech, Pageviews-API, Tool-Labs-tools-Other, and 2 others: Pageview Stats tool - https://phabricator.wikimedia.org/T120497#2284462 (Izno) [12:12:22] Analytics-Kanban: Augment oozie load SLA + Add URL to oozei error messages - https://phabricator.wikimedia.org/T134876#2284498 (JAllemandou) a:JAllemandou [12:56:12] joal: ok, I made it on time! [12:56:23] well done milimetric :) [13:00:08] milimetric, mforns: Batcave ! [13:00:24] i'm dere [13:00:50] moorninnng [13:01:34] o/ [13:01:41] hi ottomata :) [13:01:43] joal: https://gerrit.wikimedia.org/r/#/c/288184/2/modules/install_server/files/autoinstall/partman/aqs-cassandra-8ssd-2srv.cfg [13:01:55] but I am far from sure that it'll work :P [13:02:20] brutal raid 0 without LVM [13:02:34] * joal has tried read elukey's code, and now has eyes crying blood [13:02:48] * joal doesn't want to read partman EVER again :) [13:02:48] * elukey hugs joal [13:03:13] * joal hugs elukey back in compassion [13:05:05] elukey: will be ready to go with kafka in just a few mins [13:05:35] * elukey cheers for ottomata [13:08:22] I'm impressed that you actually read it, joal, last time I tried to read partman my brain just went "nope" [13:08:47] :) [13:10:34] elukey: , am disabling alerts and camus [13:11:07] ottomata: batcave? [13:11:18] k! [13:11:35] ottomata, elukey: following from IRC, ping if any need [13:16:33] (CR) Milimetric: "one improvement suggestion, otherwise good." (1 comment) [analytics/dashiki] - https://gerrit.wikimedia.org/r/288104 (https://phabricator.wikimedia.org/T122533) (owner: Mforns) [13:18:21] Analytics, Revision-Slider, TCB-Team, Patch-For-Review: Data need: User Behaviour when comparing article revisions - https://phabricator.wikimedia.org/T134861#2284978 (Addshore) The above change would provide a JSON log of data that we could then work with. A blob would be logged every time the d... [13:20:08] joal, milimetric, I arrived [13:20:13] are you guys in the meeting? [13:20:16] yay, to the batcave! [13:20:24] milimetric, the batcave is busy [13:20:46] k, to the second! a-batcave-2 [13:20:51] ok [13:22:33] joal: a-batcave-2? [13:30:03] Sure mili [13:30:14] milimetric, mforns : Yay ! batcave 2 [13:31:32] !log camus + puppet disabled on analytics1027 [13:45:36] joal: interesting thing that we noticed https://grafana.wikimedia.org/dashboard/db/kafka?panelId=32&fullscreen [13:46:13] the upgraded instances seems having less TCP established connections.. [13:48:29] ja m! [13:48:39] ottomata: I am going to snapshot netstat -tuap before/after on kafka1013 [13:48:50] I am really curious [13:49:10] ok [13:49:20] lemme know when i can proceed [13:50:42] ottomata: already done, you can go :) [13:50:44] k [13:50:45] going [14:05:11] ottomata: from what I can see there are ~100 TCP established less with cpXXXX hosts [14:05:55] hm [14:06:12] just a bit less among kafka brokers [14:06:53] same thing for mw [14:07:21] so it seems a general reduction in the amount of TCP connections opened [14:08:57] ja doesn't seem to be a drop in messages produced [14:09:02] so i guess its a good thing [14:09:27] +1 [14:09:32] maybe some sort of keep alive? [14:12:14] ja maybe [14:13:29] two brokers left [14:19:50] 1 left [14:19:50] ! [14:25:31] !!! [14:29:14] ookeeey doke! [14:29:16] all brokers on 0.9 [14:29:38] waiting a few minutes, then will proceed with the final patch to remove 0.8 puppetization and change inter broker protocol version [14:29:49] hm, maybe i'll see if can trouble shoot eventloggging along the way [14:29:55] since we have to restart each broker again [14:30:18] ottomata: please don't remove 0.8 yet :) [14:30:25] let's do it tomorrow [14:30:59] also I'd wait tomorrow to switch the broker protocol version, or at least a couple of hours [14:31:30] I am your -paranoid parameter I know [14:31:45] buuuut better safe than sorry :P [14:33:47] elukey: ottomata is walking around the house somewhere ;) [14:35:08] madhuvishy: o/ [14:35:18] \o [14:37:25] :) [14:37:36] elukey: i'm not so sure we could switch back anyway [14:37:38] dunno though [14:37:43] but ja, i guess no hurry [14:38:33] ottomata: ah really? I thought that the same process could have been applied for 0.9 -> 0.8 [14:38:42] anyhow, I'll let you decide, the migration looks awesome :) [14:39:20] ha, your caution is appreciated! there's no reason not to wait, so we might as well [14:43:52] ottomata: anything against merging https://gerrit.wikimedia.org/r/#/c/288184/4 ? [14:44:02] Chris said that it looks good to test [14:46:15] merge away! [14:46:20] it def won't hurt anything [14:46:23] worth a try [14:54:02] Analytics-Kanban, Patch-For-Review: Client values inbound in X-analytics header (pageview and preview) are reflected in outbound X-Analytics on varnish - https://phabricator.wikimedia.org/T133204#2285389 (Nuria) Open>Resolved [14:55:09] Analytics-Kanban, Datasets-Webstatscollector, RESTBase-Cassandra: Better response times on AQS (Pageview API mostly) {melc} - https://phabricator.wikimedia.org/T124314#2285392 (Nuria) @Nemo_bis : there are many issues among them tow main ones: 1) we need new hardware and 2) we need to bound queries... [14:56:21] (CR) Nuria: "Does this matter for maria DB in the case of a read-only query?" [analytics/reportupdater] - https://gerrit.wikimedia.org/r/288133 (https://phabricator.wikimedia.org/T134950) (owner: Neil P. Quinn-WMF) [14:58:03] Analytics-Cluster, Analytics-Kanban, Operations, Traffic, Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2285393 (Ottomata) All analytics brokers are now upgraded to confluent 0.9! Tomorrow we will switch off 0.8 inter broker... [15:03:42] Analytics: Pageview API: Limit (and document) size of data you can request - https://phabricator.wikimedia.org/T134524#2268100 (GWicke) Limiting window sizes also provides an opportunity to introduce aligned time windows, which would greatly improve cache effectiveness. Examples: - Specific day: `/20160101/... [15:24:03] ottomata: whenever you have time: https://github.com/wikimedia/mediawiki-event-schemas/pull/1 [15:24:19] hehe, joal they are hosted in gerrit :) [15:24:30] arrgghhh :( [15:24:35] Marf... [15:24:44] * joal is gonna find them then [15:38:32] elukey, ottomata : little hiccup in webrequest-load and burrow: kafka upgrade related? [15:39:17] ya i think so [15:39:27] joal: burrow is doing alright i think now [15:39:37] should be fine, burrow is about eventlogging, we had to restart it after every broker restart :( [15:40:10] ottomata: how come webrequest load issue? problems during restart? [15:41:36] hello all! it looks like the Pageviews API now goes back to July 2015. Is this correct? [15:41:38] joal: i think vk has a few produce errors during broker restarts [15:41:48] oh [15:41:49] actually [15:42:03] it could be related to the vk crash on some hosts after ema and bblack restarted some vks [15:42:10] they were saying it actually crashed for a bit [15:42:12] that is more likely [15:42:17] ottomata: those were for cache::misc [15:42:17] ottomata: that owuld make more sense :) [15:42:19] yeah [15:42:48] alert email mentions cache-misc, so we have our culprit :) [15:42:53] Thanks ottomata and elukey :) [15:42:55] joal: goood [15:42:59] Analytics-Kanban, Patch-For-Review: Fix Dashiki's metrics-by-project breakdown - https://phabricator.wikimedia.org/T133944#2285488 (Nuria) Open>Resolved [15:45:14] Analytics-Cluster, Analytics-Kanban, Patch-For-Review: Puppetize and make useable confluent kafka packages - https://phabricator.wikimedia.org/T132631#2285493 (Nuria) Open>Resolved [15:45:18] Analytics-Cluster, Analytics-Kanban, Operations, Traffic, Patch-For-Review: Upgrade analytics-eqiad Kafka cluster to Kafka 0.9 - https://phabricator.wikimedia.org/T121562#2285494 (Nuria) [15:45:46] Analytics-Kanban, Patch-For-Review: Make webrequest load and refine jobs a single bundle - https://phabricator.wikimedia.org/T130731#2285508 (Nuria) Open>Resolved [15:46:38] Analytics-Cluster, Analytics-Kanban, Operations, Patch-For-Review: Spark yarn in client mode is never moved from ACCEPTED to RUNNING - https://phabricator.wikimedia.org/T134422#2265192 (Nuria) @ottomatta: can this be closed? there is an open question from @MoritzMuehlenhoff [15:46:53] Analytics-Cluster, Analytics-Kanban: Standardize use of refinery_path over oozie_path in all refinery oozie property files - https://phabricator.wikimedia.org/T133206#2285516 (Nuria) Open>Resolved [15:46:58] Analytics: Pageview API: Limit (and document) size of data you can request - https://phabricator.wikimedia.org/T134524#2268100 (JAllemandou) @Gwicke: We plan to limit the volume of data requested but not the way to request it, meaning we'll probably keep start and end date. If you're interested I have made a... [15:47:04] Analytics-Tech-community-metrics, Developer-Relations, Community-Tech-Sprint: Investigation: Can we find a new search API for CorenSearchBot and Copyvio Detector tool? - https://phabricator.wikimedia.org/T125459#2285519 (coren) @kaldari I've also been taking a look at how valuable the various search... [15:47:07] Analytics-Kanban, Patch-For-Review: Count jam.wikipedia pageviews - https://phabricator.wikimedia.org/T134279#2285520 (Nuria) Open>Resolved [15:47:20] Analytics-Kanban: Standardise naming in oozie jobs (particularly for top level ones) - https://phabricator.wikimedia.org/T130732#2285521 (Nuria) Open>Resolved [15:47:34] Analytics-EventLogging, Analytics-Kanban, Patch-For-Review: EventLogging dies when fetching a schema over HTTP that does not exist. {oryx} - https://phabricator.wikimedia.org/T124799#2285522 (Nuria) Open>Resolved [15:47:49] Analytics-Cluster, Analytics-Kanban, Operations, Patch-For-Review: Spark yarn in client mode is never moved from ACCEPTED to RUNNING - https://phabricator.wikimedia.org/T134422#2285527 (Nuria) Open>Resolved [15:47:59] Analytics-Cluster, Analytics-Kanban, Operations, Patch-For-Review: Spark yarn in client mode is never moved from ACCEPTED to RUNNING - https://phabricator.wikimedia.org/T134422#2265192 (Nuria) Resolved>Open [15:49:34] Analytics-Cluster, Analytics-Kanban: Publicize stat1004 - https://phabricator.wikimedia.org/T133056#2285532 (Nuria) Open>Resolved [15:50:42] (CR) Nuria: "I think commit message should lists bugs and repro steps so we can properly test fixes." [analytics/dashiki] - https://gerrit.wikimedia.org/r/288104 (https://phabricator.wikimedia.org/T122533) (owner: Mforns) [15:52:17] Analytics: Pageview API: Limit (and document) size of data you can request - https://phabricator.wikimedia.org/T134524#2285540 (GWicke) @JAllemandou, are there significant cache hit rates for per-article requests at all right now? Every time I looked into timeouts, it was for > 1 month worth of per-article d... [15:52:40] ottomata: did you just do some magic with EL? :D [15:55:42] elukey: ha, i just backfilled stuff from labs, ja :) [15:57:32] Analytics-Tech-community-metrics, Developer-Relations, Community-Tech-Sprint: Investigation: Can we find a new search API for CorenSearchBot and Copyvio Detector tool? - https://phabricator.wikimedia.org/T125459#2285542 (Ricordisamoa) >>! In T125459#2278845, @Earwig wrote: > @Ricordisamoa As a servic... [16:03:40] a-team - new york, can you hear us? [16:03:49] ottomata: there is something going on with your connection [16:03:53] ottomata: no video [16:03:59] ottomata: we cannot mute you [16:04:28] ottomata: do you want to rejoin? [16:06:08] mforns: andrew's laptop is having some trouble [16:06:21] madhuvishy, ok [16:11:29] Analytics, Analytics-Wikistats, Internet-Archive: Total page view numbers on Wikistats do not match new page view definition - https://phabricator.wikimedia.org/T126579#2285566 (ezachte) So my post from two days ago was a false lead. The issue of the rounding error is real, but also really small. Mor... [16:42:20] Analytics, Continuous-Integration-Config: Add a maven-release user to Gerrit {hawk} - https://phabricator.wikimedia.org/T132176#2285719 (madhuvishy) [16:45:43] Analytics, Continuous-Integration-Config: Add a maven-release user to Gerrit {hawk} - https://phabricator.wikimedia.org/T132176#2190712 (madhuvishy) Can someone from releng help with this? Pinging @demon - It was mentioned to me sometime that he usually handles similar requests. This task involves making... [16:48:22] milimetric: while Andrew is away, do you know how long it takes for the copying of the data to prod happen? following up on what Andrew did to move the data to MySQL. [17:02:55] Analytics, Continuous-Integration-Config: Add a maven-release user to Gerrit {hawk} - https://phabricator.wikimedia.org/T132176#2285850 (madhuvishy) From chatting on irc, It looks like I can make it myself. So i will! Thanks y'all! [17:04:31] going offline, byyeee! [17:05:33] bye elukey! good night :) [17:26:00] Analytics, DBA, Patch-For-Review: Reportupdater does not commit changes after each query - https://phabricator.wikimedia.org/T134950#2285979 (mforns) Hi @Neil_P._Quinn_WMF Thanks for spotting this, and for the thorough description. > However, that connection had been holding open a transaction for... [17:38:14] ah elukey I forgot to reenable camus! :o [17:38:15] just did [17:39:01] ottomata: sorry, didn't check for jobs flowing after restart [17:50:32] Analytics, DBA, Patch-For-Review: Reportupdater does not commit changes after each query - https://phabricator.wikimedia.org/T134950#2286109 (jcrespo) > not sure if they still use innodb_history_list_length Even if no InnoDB is touched, if a transaction is opened and not closed regularly, InnoDB has... [17:59:28] Analytics, DBA, Patch-For-Review: Reportupdater does not commit changes after each query - https://phabricator.wikimedia.org/T134950#2286146 (Neil_P._Quinn_WMF) Thanks @mforns and @jcrespo! Also, mySQLdb doesn't seem to be maintained anymore ([last commit in January 2014](https://github.com/farcepes... [18:01:44] (CR) Neil P. Quinn-WMF: "Nuria: yes, it does seem that there's an effect even with read-only queries. I put the details in T134950." [analytics/reportupdater] - https://gerrit.wikimedia.org/r/288133 (https://phabricator.wikimedia.org/T134950) (owner: Neil P. Quinn-WMF) [18:02:45] (CR) Neil P. Quinn-WMF: "(The query described in the description is read-only.)" [analytics/reportupdater] - https://gerrit.wikimedia.org/r/288133 (https://phabricator.wikimedia.org/T134950) (owner: Neil P. Quinn-WMF) [19:13:11] Analytics, DBA, Editing-Analysis, Patch-For-Review: Reportupdater does not commit changes after each query - https://phabricator.wikimedia.org/T134950#2286403 (Neil_P._Quinn_WMF) [19:25:24] Analytics-Tech-community-metrics, Developer-Relations, Community-Tech-Sprint: Investigation: Can we find a new search API for CorenSearchBot and Copyvio Detector tool? - https://phabricator.wikimedia.org/T125459#2207471 (Crow) @kaldari @Earwig As a consumer of both tools, I can say there have been ti... [20:19:20] (PS1) Milimetric: [WIP] Support YYYYMMDDHH format for the unique endpoint [analytics/aqs] - https://gerrit.wikimedia.org/r/288264 (https://phabricator.wikimedia.org/T134840) [20:27:06] Analytics-Tech-community-metrics, Developer-Relations, Community-Tech-Sprint: Investigation: Can we find a new search API for CorenSearchBot and Copyvio Detector tool? - https://phabricator.wikimedia.org/T125459#2286693 (Compassionate727) @Crow @kaldari @Earwig I can agree with that.