[01:03:11] (03CR) 10Milimetric: [V: 032 C: 032] "You definitely got it basically right with your first patch, the rest is just formatting so it fits our idiosyncrasies. Ping us when you'" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/415798 (https://phabricator.wikimedia.org/T152222) (owner: 10Cicalese) [01:09:59] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Investigate listing the "Onboarding New Developers" KPIs on a custom dashboard - https://phabricator.wikimedia.org/T179329#4025656 (10Aklapper) [01:15:56] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Investigate listing the "Onboarding New Developers" KPIs on a custom dashboard - https://phabricator.wikimedia.org/T179329#4025687 (10Aklapper) 05Open>03Resolved Creation worked now: `Saved Dashboard as "C_KPIs"` (`C_` prefix because... [01:23:44] 10Analytics-Kanban: English Wikivoyage traffic spike possible bot - https://phabricator.wikimedia.org/T187244#4025698 (10kaldari) 05Open>03Resolved @Tbayer: Nice sleuthing! BTW, now that the WikiVoyage Edit-a-thon is over, the pageviews have gone back to normal. [02:15:27] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Move statistics::discovery jobs from stat1002 -> stat1005 - https://phabricator.wikimedia.org/T170471#4025790 (10mpopov) [04:01:43] PROBLEM - statsv Varnishkafka log producer on cp5010 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:03:43] RECOVERY - statsv Varnishkafka log producer on cp5010 is OK: PROCS OK: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf [07:05:27] ah nice! Singapore vks! :) --^ [08:44:46] 10Analytics-Kanban, 10User-Elukey: Reboot all Analytics hosts for Kernel upgrade - https://phabricator.wikimedia.org/T188594#4026220 (10elukey) [08:45:12] elukey: nice hearing from them too! [08:45:37] oh yes! [08:47:27] ema, elukey: We haz hits from *.eqsin.wmnet in webrequest :) [08:48:00] happy times [08:48:04] joal: morning! [08:48:09] all good if I reboot archiva? [08:48:19] hi elukey :) Good for me :) [08:48:24] super [08:53:51] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): For new authors on C_Gerrit_Demo, provide a way to access the list of Gerrit patches of each new author - https://phabricator.wikimedia.org/T187895#4026231 (10Aklapper) 05Open>03Resolved "For new authors on C_Gerrit_Demo, provide a w... [08:54:37] archiva up and running [09:00:08] \o/ [09:10:41] * ema joins the celebrations by going for a coffee [09:41:23] !log stop eventlogging's mysql consumers for db1107 (el master) kernel updates [09:41:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:08:43] !log re-starting mysql consumers on eventlog1001 [10:08:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:13:06] eventlogging master db rebooted and mariadb upgraded by Manuel, all goood [10:17:34] elukey: Good job mate !! [10:19:15] !log restart webrequest-load-wf-upload-2018-3-6-7 (failed due to reboots) [10:19:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:19:39] joal: I didn't do anything! Saint Manuel from Madrid is the one that we should thank :) [10:30:38] Let do that then elukey :) [10:31:48] elukey: have you rebooted some analytics host this morning? [10:31:54] joal: yep [10:32:01] Ok :) [10:32:07] elukey: done for now? [10:32:46] elukey: I feel like this spark job I have always runs when we need to reboot machines :) [10:33:04] nope, I am about to reboot an106[8,9] :( [10:33:08] did I kill your job? [10:33:21] elukey: you didn't kill it, it failed for worng reasons [10:33:39] elukey: you've not rebooted an1069 yet? [10:33:44] nope [10:33:49] just stopped yarn [10:33:53] Ahhh :) [10:33:56] That explains :) [10:34:11] I always wait for all the jvms to complete, then I stop hdfs and finally reboot [10:34:50] elukey: my job uses a yarn-child daemon (spark-shuffle service), and therefore stopping arn makes my job fail [10:35:10] elukey: that's interesting however - it means the job is super-sensitive to failures (which I already knew) [10:35:41] elukey: I'll wait for the reboots to be done before restarting [10:36:44] joal: I am wondering if there is a different/better stategy to prepare nodes to shutdown [10:37:16] hm elukey - I actually don' know [10:37:44] something like sending a drain msg to the yarn node manager to not be available for more work except the jvms running [10:37:53] The only concern I have seen so far is when heavy spark jobs are runnig [10:38:44] I need to find a better or more automated way to reboot nodes, it takes too much time [10:38:59] hm [10:45:50] he team :] [10:45:57] *hey [10:46:48] joal, do you have 10 mins to save me from scala hell? [10:46:58] Hi mforns [10:47:03] mforns: To the cave! [10:47:06] ok! [10:52:31] rebooted all worker nodes except [10:52:58] 28/35/52 (journal nodes) - 62 (down due to dimm issues) [10:53:26] so now I can reboot the first three one at the time [11:02:06] elukey: good for me [11:04:21] just checked the hdfs ui, the quorum is ok [11:04:25] so I can start with 28 [11:07:12] then after lunch I'll do an1001/2 [11:07:16] so the cluster will be done [11:07:35] after that, druid's turn :P [11:27:32] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00 consumer/mysql-eventbus [11:27:51] this is me --^ [11:39:54] joal: I have a good news! (I am very sarcastic now :P) [11:41:58] so one step that was missed during the db upgrade was that it failed over to db1108 (analytics-slave) [11:42:16] and it didn't roll back to db1107 because it needs a manual action (wasn't aware of it) [11:42:29] soooo the mysql consumers have been writing to the slave [11:43:41] so now the situation is [11:47:22] Hi elukey [11:47:57] elukey: I'm assuming we'll have to make db1107 read from 1108, then swap back? [11:48:39] yes this is the idea that I have too [11:51:45] elukey: and in the mean time we'll have to sop EL onsumers I think (to prevent adding more rows to ) [11:51:54] already did yes [11:52:08] the issue might be trickier though, I am talking with Jaime now [11:52:30] ok elukey [11:52:42] elukey: where do we stand on hadoop reboots? [11:52:47] done with journal nodes? [11:52:58] only 35/52 left [11:53:56] I stopped after 28 due to the EL isuse [11:54:57] ok elukey - I'm afraid to restart my job before you reboot them :) [11:55:23] joal: please do since I am afraid my afternoon will be dedicated to el :( [11:55:32] elukey: ack ! [11:55:33] I'll reboot them tomorrow [12:01:06] Thanks elukey - Please let me know if I can help with EL (even if it's not my main area opf strength) [12:01:36] the main issue seems to be this one [12:01:45] 1) db1108 replication was stopped [12:01:58] 2) el kept inserting to db1107 incrementing ids [12:02:08] 3) el stopped, no more inserts [12:02:16] 4) db1107 stopped and rebooted [12:02:39] 5) el restarted, inserts to db1108 [12:02:51] 6) mistake found, el stopped [12:02:54] Oh waow ... That's not coll [12:03:02] hm [12:03:31] the main issue is that now both hosts have incremented some ids that the other one doesn't have [12:03:40] so data is not lost but in a weird state [12:04:18] elukey: manual scripts to gather differences (using timestamps) between 7 and 8, and reimport from 7 to 8 only needed values [12:05:20] elukey: :( [12:05:54] joal: or something like - manual script to get differences and import them from 8 to 7, then delete on 8 that data and leave the replication to do the rest [12:06:53] elukey: if we're happy with 7 being master, yes, would do [12:07:33] yes 7 is the master [12:07:49] and 8 usually the slave :D [12:07:56] today it got promoted :D [12:08:12] huhu [12:08:31] I don't understand how we endedup insterting into 8 though [12:08:54] so we have a dbproxy for m4-master.eqiad.wmnet (that we don't manage) [12:09:00] (we == analytics) [12:09:20] once db1107 was stopped, it failed over to db1107 [12:09:22] Ohhh, when 7 rebooted (stage 4), 8 got promotted [12:09:23] err db1108 [12:09:35] right [12:09:53] the issue was that 07 was not restored as m4-master [12:10:41] makes sense elukey [12:10:49] I guess that we need more people to understand what to do [12:12:43] elukey: I think writing a script moving all data inserted onto 8 to 7, then delete it from 8, would be the correct thing to do [12:17:53] elukey: Wow, something else I just discovered - I have eventlogcleaner-cron error emails in my spam box !!! [12:46:11] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4026711 (10elukey) p:05Triage>03High [12:47:26] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (doing): Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052#4026731 (10mobrovac) [12:53:25] taking a break a-team [12:56:41] all right alerted analytics@ [12:56:51] tracking task is https://phabricator.wikimedia.org/T188991 [12:57:59] going to grab a quick lunch! [13:01:24] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4026817 (10elukey) [14:03:41] I am back [14:07:55] hiiii [14:09:00] hello ottomata! [14:09:25] as you may have already seen in the emails there is an interesting issue for eventlogging :( [14:09:33] i haven't! just trying to go through emails [14:09:43] but need to prep for webrequest text switchover [14:09:45] we still ok for that? [14:10:11] not sure, I'd prefer to fix the EL issue with you if possible [14:10:23] we scheduled this time with jeff green [14:10:24] is all [14:10:33] ah ok didn't know that [14:10:37] https://phabricator.wikimedia.org/T185136#4014394 [14:10:42] sure then [14:10:47] k [14:11:03] looking at email [14:12:04] elukey: since this is custom repl [14:12:11] couldn't you just run the replication scripts backwards for a bit? [14:12:16] slave -> master? [14:12:35] it compares based on latest id/timestamp [14:12:44] the main issue is that there should be data pushed to db1107 that was not yet consumed by the replication [14:12:47] so as long as you haven't inserted newer data already in master [14:12:51] oh [14:12:52] yeah [14:12:52] hm [14:13:07] I didn't check all the tables, I wanted to chat with you first [14:13:24] I can check the discrepancies now [14:13:28] you can move consumer groups somehow...you'd have to do it manually with some code, but that would work ok, i think [14:13:35] you'd have to find the proper offsets [14:13:39] for each partition [14:13:57] hm [14:14:44] it was a misunderstanding between me and Manuel, I thought that the dbproxy was set to master and he thought that I would have done it (and also that db1108 was read only) [14:14:48] so we ended up in this [14:15:01] aye [14:15:23] but in theory it can happen anytime if there is a failover [14:15:44] it auto fails over? [14:15:52] yep [14:15:55] maybe we should not use the proxy name in the mysql writer [14:18:44] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#4027077 (10Ottomata) [14:18:51] going to check max(timestamp,id) on all the el tables [14:18:56] k [14:20:01] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4027078 (10Ottomata) > We run a proxy in front of the eventlogging database, called m4-master If we can't write to the failover, then we probab... [14:20:26] elukey: i'm waiting for jeff now anyway, i'll look into that offset reseting thing [14:27:13] hi ottomata [14:27:16] hayyyy [14:27:17] ok [14:27:21] https://phabricator.wikimedia.org/T185136 [14:27:25] https://gerrit.wikimedia.org/r/#/c/416683/ [14:27:30] if i merge and apply ^ [14:27:49] no more messages will go to analytics eqiad cluster, but all to jumbo-eqiad [14:27:58] then we let kafkatee finish consuming from the old cluster.... [14:28:10] then flip the host list, and whack the state file [14:28:19] and that should be it, right? [14:28:46] yup [14:29:21] joal: yt? https://gerrit.wikimedia.org/r/#/c/415636/ [14:29:27] ok, I think you can merge anytime, I'll puppetize the hostlist change [14:29:29] i'm gonna merge that now [14:29:34] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4027191 (10Marostegui) >>! In T188991#4027078, @Ottomata wrote: >> We run a proxy in front of the eventlogging database, called m4-master > > I... [14:29:35] ok, Jeff_Green got a couple of prep steps, will letcha know [14:29:39] ok [14:34:20] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4027223 (10elukey) >>! In T188991#4027191, @Marostegui wrote: >>>! In T188991#4027078, @Ottomata wrote: >>> We run a proxy in front of the event... [14:35:42] elukey: ok, getting ready to do this, ya? [14:35:49] i'm going to stop puppet on text vks [14:36:16] ottomata: ack, everything seems good [14:36:27] alerted the traffic team [14:36:35] danke [14:39:37] ok Jeff_Green, elukey I'm merging. puppet is disabled. will run on a single host, test, and then batch run puppet on remaining cache text hosts over a few minutes [14:39:43] ok [14:40:33] ack [14:46:11] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4027272 (10Marostegui) >>! In T188991#4027223, @elukey wrote: >>>! In T188991#4027191, @Marostegui wrote: >>>>! In T188991#4027078, @Ottomata wr... [14:46:56] ottomata: kafka-consumer-groups on 0.11 seems to have some options to play with offsets [14:47:12] ok, cumin batch running, text should be fully migrated in 10ish minutes [14:47:18] super [14:47:36] there is kafkatee on oxygen to monitor/switch right? [14:47:43] ah no wait we have two instances [14:47:47] you already got it [14:48:26] yup [14:48:37] there is the banner impression streaming job to restart [14:48:50] i've already made a puppet change so that the stream_check thing will use jumbo [14:48:58] saw it yes, already merged? [14:49:00] ya [14:49:12] once done i guess we can just kill the job and watch the check restart it? [14:49:21] so it is only a matter of killing the job, it will be restarted [14:49:25] yeah [14:50:43] elukey: what's the earliest timestamp of a el mysql event that you could use as a place to restart the consumers from? [14:50:50] i'll try and figure out how we can re-set the offsets [14:52:28] ottomata: this is a good question, not sure [14:52:32] I build https://meta.wikimedia.org/wiki/Config:Dashiki:Pingback and tested it locally as best as I could without data being available yet. It appears to work well. Any chance somebody could kick off reportupdater to generate the pingback data? [14:54:04] from the sal I've stopped the mysql consumers (the first time) around 9:40ish [14:54:09] UTC [14:54:41] but I am checking last timestamps on db1107 now [14:55:14] most of them around that hour, a couple of minutes before [14:55:25] now the main issue though is the id auto-increment no? [14:55:31] ok, i mean, a little bit before is fine, since dups will be handled [14:55:33] I'm on it, CindyCicaleseWMF, I had forgotten one little piece last night or else the reportupdater job would've already been running [14:55:37] its going to be imprecisie anyway [14:55:56] ya, i'm working on offsets for consumer to restart for master [14:56:04] something else will need to be done to delete from slave [14:56:21] yes this is the issue [14:57:20] milimetric, thanks! Eagerly awaiting the pretty graphs :-) [14:57:53] ottomata: https://grafana.wikimedia.org/dashboard/db/kafka-by-topic-prometheus?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kafka_jumbo&var-kafka_broker=All&var-topic=webrequest_text [14:57:57] wow :) [14:58:37] :) [14:58:59] all right lemme kill banner impression then [15:00:37] elukey: wait [15:00:41] puppet not done yet [15:00:42] still going [15:01:13] it is fine, it will be restarted in 5 min IIRC by the cron, but I need to update hiera right? [15:01:22] (already killed it sorry) [15:01:35] elukey: it will start consuming from jumbo, but there will still be new messages in analytics [15:01:52] i guess it doesn't matter because the batch job replaces data, right? [15:02:16] elukey: hiera already updated [15:02:20] super [15:02:22] ottomata: whenever you have a sec, I need a merge on a ReportUpdater job: https://gerrit.wikimedia.org/r/#/c/416698/ [15:02:38] thx much [15:02:47] done [15:04:27] hm, actually, puppet is almost done on all, why only 1.5K/s? [15:04:28] hm [15:04:45] OH [15:04:45] hahah [15:04:51] elukey: i'm running puppet on uploads [15:04:52] DOH [15:05:10] oook, running puppet on texts now.... [15:05:28] ok, it'll be another 10 mins then... [15:06:06] ottomata: ahahah I was about to ask, wasn't seeing the sequence numbers dropping to zero.. [15:06:12] (on vks) [15:06:14] aye [15:10:55] ottomata: I might have an idea for the el db mess [15:12:13] that might work [15:12:57] whenever you are ok I can brainbounce it with you [15:12:58] ya? [15:13:09] bc? [15:14:06] k [15:20:26] (03PS19) 10Mforns: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) [15:20:56] ok, CindyCicaleseWMF, the dashboard config is merged into dashiki, the job is configured, and the dashboard is deployed: https://pingback.wmflabs.org/#media-wiki-version [15:21:07] the only thing remaining is for the data to get created [15:21:39] so at some point today-ish you'll start seeing data and it'll keep updating until it's caught up [15:21:43] then it'll run weekly [15:21:44] milimetric: great!! [15:21:53] thanks for all of your help!! [15:22:45] (03CR) 10jerkins-bot: [V: 04-1] Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [15:24:22] no problem at all, dashboards are easy. Collecting the right data to answer questions is hard [15:29:25] Jeff_Green: puppet done [15:29:34] no text in kafka analytics anymore [15:29:39] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=13&fullscreen&orgId=1&from=now-1h&to=now&var-instance=webrequest [15:29:41] do the kafkatee thing [15:29:42] confirmed [15:29:43] all good [15:29:46] \o/ [15:29:47] ok, logs seem to have stopped collecting [15:31:19] Hey ottomata / elukey [15:31:31] Looks like the move to jumbo worked like a charm ! [15:32:50] it seems so :) [15:33:03] vk is performing well, tls rtt stable and no errors registered [15:33:09] \o/ ! [15:33:28] ottomata: Thanks for the path on streaming cron ! [15:33:36] ottomata: I did see i, but did not +1 :( [15:33:46] seems to be working! [15:35:18] elukey/ ottomata - This however looks weird [15:35:19] https://yarn.wikimedia.org/proxy/application_1518549639452_78067/streaming/ [15:35:25] great! [15:35:46] joal: maybe job needs restated? [15:35:47] elukey: ? [15:36:19] ottomata: I have an idea [15:36:33] i betcha it got restarted with old params [15:36:36] can we kill it? [15:36:47] ottomata: nope, confirmed it got restarted with good params [15:36:51] oh [15:36:52] ? [15:37:02] ottomata: however spark-kafka uses zookeeper !!!! [15:37:05] !!! [15:37:16] yeah, but that should be ok...as long as it odesn't do it on its own terms [15:37:24] Arf [15:37:27] Maybe not then [15:37:28] the kaf ka zoookeper commit should do it in a different zookeeper chroot [15:37:43] but how does it use zookeeper? [15:37:50] we don't provide it the zk connect string [15:37:51] do we ? [15:38:22] ottomata: I'm reviewing the options, and it seems we use ZK onlu for druid [15:38:29] plus a worker cannot connect to zk [15:38:37] it is firewalled [15:39:16] ottomata: I think I have an idea: since we've restarted the job with checkpointed stat, I think it read the config from the heckpoint [15:39:40] let me kill and manually restart withtout checkpoint, then kill again for automagic restart [15:39:53] 10Analytics-Kanban, 10Operations, 10ops-eqiad: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4027495 (10Cmjohnson) I swapped the B side DIIMM to the A side to see if the error returns and follows the DIMM. Powered server on, let's check back in a day or so. [15:40:49] hm [15:40:50] oook... [15:43:47] Jeff_Green: great, easy peasy then, we good, right? [15:43:55] as far as I can tell, yes [15:44:09] greaat, thanks! [15:44:15] thank you! [15:46:24] ottomata: The chart of data flowing into jumbo is super awesome :) [15:47:10] elukey: is anyone helping you with data recovery? [15:50:07] milimetric: i'm helping [15:50:34] k, sweet [15:52:07] 10Analytics-Kanban, 10Operations, 10ops-eqiad: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4027595 (10elukey) Thanks! [15:52:08] in the meantime, an1062 is back [15:56:19] will be 5 min late to standup, sorry a-team [16:00:32] (03PS20) 10Mforns: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) [16:00:49] trying to join... [16:01:25] ping milimetric [16:02:42] (03CR) 10jerkins-bot: [V: 04-1] Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [16:03:45] PROBLEM - Number of banner_activity realtime events received by Druid over a 30 minutes period on einsteinium is CRITICAL: CRITICAL - druid_realtime_banner_activity is 0 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=41&fullscreen&orgId=1 [16:05:45] RECOVERY - Number of banner_activity realtime events received by Druid over a 30 minutes period on einsteinium is OK: OK - druid_realtime_banner_activity is 375 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=41&fullscreen&orgId=1 [16:05:57] ottomata: --^ [16:06:03] ottomata: We're back on track [16:06:22] ottomata: I'll kill my manual restart to let the automatic one do its job [16:07:39] ottomata: Learning point here - Checkpointed streaming jobs don't update their conf [16:08:16] hmmm k [16:08:19] thanks joal [16:09:26] ottomata: it makes sense - Like, part of the saved state are kafka offsets, so when changing kafka brokers, doesn't make sense to keep them etc [16:09:43] (03PS21) 10Mforns: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) [16:12:39] (03CR) 10jerkins-bot: [V: 04-1] Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [16:19:32] (03CR) 10Ottomata: Add EL and whitelist sanitization (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [16:22:42] (03CR) 10Mforns: Add EL and whitelist sanitization (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [16:23:39] (03PS22) 10Mforns: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) [16:32:39] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4027752 (10Ottomata) Bump, what's the status on these? [16:48:36] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4027808 (10Cmjohnson) @ottomata: these needs installs if you have the spare cycles feel free. On-site work is done [16:58:02] 10Analytics-Kanban, 10Patch-For-Review: Spark 2.2.1 as cluster default (working with oozie) - https://phabricator.wikimedia.org/T159962#4027831 (10Ottomata) Oof, joal, ya, copied spark-assembly instead of oozie-sharelib. Fixed now. [17:03:31] elukey: i got a script working for recomitting offsets i think [17:04:21] https://gist.github.com/ottomata/d69ba72313c44e8e45e6453f4ea97074 [17:04:22] lunching... [17:07:17] super [17:07:26] I am going through the tables to purge [17:07:50] so on the master on ~90ish have 1) a timestamp field 2) that is 20180306etc.. [17:08:42] I am assuming that a recent table (that gets new inserts) must have the timestamp field (not sure about the eventbus ones, still need to check) [17:10:02] so in the end it could be as easy as delete from %table where timestamp > 20180306093000 [17:10:21] (on the slave) [17:10:37] ottomata: what timestamp did you use for your offsets? [17:11:25] elukey: about 9:35 [17:11:48] oh eventbus ones...right oof [17:11:51] i only did valid mixed [17:11:52] hm [17:17:13] milimetric: does Amir know you have productionised his data? [17:18:37] milimetric: I ask since I have noticed some queries lately that look like the usual unprodutionised cross-navigation ones [17:20:12] ottomata: there is also something that I don't understand.. I tried to do max(timestamp) on all the tables on db1108, and the max one is 20180306101617 only on few of them [17:20:18] I expected more [17:21:20] I mean, I've restarted mysql consumers around 10:08 [17:21:34] and max timestamp is only 10:16? [17:27:16] 10Analytics, 10EventBus, 10MediaWiki-General-or-Unknown, 10Multi-Content-Revisions, 10Services (watching): It should be possible to understand the reason of revision creation from RevisionRecordInserted hook - https://phabricator.wikimedia.org/T188396#4027964 (10Addshore) [17:28:14] ok maybe I am crazy [17:28:33] so, if I go to evenlog1001's m4-consumer logs I can see this as last entry [17:28:36] 2018-03-06 11:24:38,168 [24623] (MainThread) Inserted 15 UniversalLanguageSelector_7327441 events in 0.009351 seconds [17:28:41] good [17:28:53] now on the EL slave [17:28:54] elukey@db1108:~$ sudo mysql --skip-ssl <<< 'select max(timestamp) from log.UniversalLanguageSelector_7327441' [17:28:57] max(timestamp) [17:29:00] 20180306095650 [17:29:23] so I assume that EL, due to processing time etc, did not process a ton of data in those two hours [17:29:54] ottomata: --^ [17:29:58] does it make sense? [17:30:06] I am just sanity checking what's in the db [17:30:30] if so, I can generate the list of delete statements easy [17:30:41] 1) grab the list of tables with timestamp field [17:30:59] 2) delete from table where timestamp > 09:30 [17:31:29] (sorry am half afk as i make lunch...) hm [17:32:21] ah yes okok :) [17:32:23] elukey: hm not sure [17:32:39] is that true of all tables? [17:32:47] they have about a 9:56 as latest timestamp? [17:32:56] we can check a few dts for messages at currently committed offset in kafka [17:33:49] elukey: partition 0 for eventlogging-valid-mixed mysql consumer group is at offset 760189269 [17:33:57] kafkacat -b kafka1012:9092 -C -t eventlogging-valid-mixed -p 0 -o 760189269 -c 1 | jq .dt [17:33:57] "2018-03-06T11:24:38" [17:34:20] that's MediaViewer [17:34:32] MediaViewer_10867062 [17:34:36] what's the latest timestamp in that table? [17:34:54] elukey@db1108:~$ sudo mysql --skip-ssl <<< 'select max(timestamp) from log.MediaViewer_10867062' [17:34:57] max(timestamp) [17:35:00] 20180306093949 [17:36:29] Gone for diner, back after [17:36:49] joal: he may be trying to compare to vet the new data? But yeah, he definitely knows [17:37:09] ottomata: how did you find 760189269 ? [17:37:17] Amir1: did you still want to work on the dashboard, I forgot you pinged last week [17:37:41] elukey: https://gist.github.com/ottomata/d69ba72313c44e8e45e6453f4ea97074 [17:37:47] https://gist.github.com/ottomata/d69ba72313c44e8e45e6453f4ea97074#file-el_commit-py-L55 [17:38:05] http://kafka-python.readthedocs.io/en/master/apidoc/KafkaConsumer.html#kafka.KafkaConsumer.committed [17:41:46] now I am wondering if eventlogging_sync have somehow played a role in this mess [17:46:11] nope it doesn't make sense [17:46:20] last entry with rows on db1108 is [17:46:21] Mar 6 09:37:32 db1108 eventlogging_sync.sh[5051]: 2018-03-06T09:37:32 localhost log MediaViewer_10867062 (rows!) ok [17:46:53] then (nothing) ok [17:48:22] elukey: just to be sure, what's the latest timestamp for MediaViewer_10867062 on db1107? [17:49:10] elukey@db1107:~$ sudo mysql --skip-ssl <<< 'select max(timestamp) from log.MediaViewer_10867062' [17:49:13] max(timestamp) [17:49:15] 20180306093821 [17:49:48] hm [17:50:00] I am not sure if I am missing something trivial [17:50:05] strange that the committed offset for the consumer is later [17:50:08] something is def not right [17:50:14] yeah [17:50:17] i mean, i guess it shouldn't matter, as we'll reset the offsets anyway [17:50:24] and reconsume stuf [17:50:25] stuff [17:50:33] so as long as the slave is fixed so it will replicate from master properly [17:50:35] we should be ok [17:51:50] but I can't explain why the data is weird [17:52:04] it should have inserted no? [17:52:55] elukey: what is the latest latest timestampin all tables on db1108? [17:53:02] possible that the tables we are looking at just have very little data? [17:53:15] hm, no, mediaviewer we know should have a later ts [17:53:24] the latest seems to be 10:16 [17:53:58] MobileWikiAppOnboarding_9123466 [17:54:26] that is 09:35 on db1107 [17:54:37] v strange [17:58:11] (03PS1) 10Milimetric: Add pingback dashboard [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/416732 [17:58:23] (03CR) 10Milimetric: [V: 032 C: 032] Add pingback dashboard [analytics/dashiki] - 10https://gerrit.wikimedia.org/r/416732 (owner: 10Milimetric) [18:03:43] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (done): Failed to acquire page lock in LinksUpdate - https://phabricator.wikimedia.org/T188106#4028144 (10Pchelolo) Since the deployment of the fix, this has not happened again even though the rate of refreshLinks jobs is fairly high right now. I'll... [18:05:25] (03CR) 10Ottomata: Ground sqoop output to DEVNULL (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/415909 (owner: 10Milimetric) [18:06:47] ottomata: whenever you have done with lunch etc.. can we have ~30 mins in bc to coordinate the el work? [18:06:56] *are done [18:07:33] (03CR) 10Milimetric: "It's only no sqoop output. The script puts out plenty of output, and dry-run would never get any sqoop output anyway. The right way to f" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/415909 (owner: 10Milimetric) [18:09:08] (03CR) 10Ottomata: [C: 031] "Oook!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/415909 (owner: 10Milimetric) [18:09:21] elukey: sure lets go b [18:09:22] bc [18:09:34] ottomata: we can do later on if you want! [18:09:41] naw i'm all lunched [18:10:55] milimetric: hey, I don't remember pinging :D I'd love to help to get dashboard in a better design though [18:11:12] Amir1: oops! wrong amir [18:11:21] I guessed :D [18:11:24] ok, Amir1, what are _you_ talking about :) [18:11:42] mforns: if you want, I can show you around this data [18:11:57] ok, batcave in 1 minute? [18:12:02] milimetric, ^ [18:12:03] milimetric: mostly making the wikistats 2.0 align with wikimedia ui design guide [18:12:13] mforns: bc-2 'cause 1 is taken, but I'll be there [18:12:22] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4028162 (10RobH) I'm working on getting these installed today. [18:12:41] ok [18:13:17] Amir1: yeah, Volker sent us some thoughts, we'll look at them sometime this summer, not focusing at that level on the UI yet [18:13:48] milimetric: cool, let me know if I can help with that in the summer [18:14:14] will do [18:28:11] 10Analytics, 10ChangeProp, 10EventBus, 10MediaWiki-JobQueue, and 2 others: Set up grafana alerts for JobQueue-EventBus - https://phabricator.wikimedia.org/T189038#4028220 (10Pchelolo) p:05Triage>03Normal [18:52:55] 10Analytics, 10Collaboration-Team-Triage, 10EventBus, 10StructuredDiscussions: Does not appear edits in IRCfeed or Event Stream when non autopatrolled users added comment - https://phabricator.wikimedia.org/T187861#4028403 (10Catrope) [18:59:33] (03CR) 10Joal: [C: 031] "I just reviewed the "mask" code - 1 comment inline about an idea, but sounds good to go as-is :)" (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [19:06:21] !log cleaned up id=0 rows on db1108 (log database) for T188991 [19:06:23] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:06:24] T188991: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991 [19:07:32] ottomata: db1108 cleaned (the el tables part) [19:08:30] cool ok [19:08:44] elukey: needs it for the eventbus _2 tables too [19:09:35] yep I am checking now [19:10:11] ok, i'm ready to do commits when you are [19:11:57] ottomata: done! [19:13:05] ok elukey bc? [19:14:49] sure [19:27:30] 10Analytics: Mediawiki History: moves counted twice in Revision - https://phabricator.wikimedia.org/T189044#4028787 (10Milimetric) [19:32:25] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [19:32:57] elukey: thanks for this work [19:33:56] nuria_: Andrew is the master behind this recovery, I only executed some deletes :) [19:34:37] so now we should be ok [19:34:46] the consumers are happily inserting again [19:34:54] going to write a summary and then update the analytics@ list [19:35:08] elukey: ok [19:40:07] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4028840 (10elukey) It seems that we were (kind of) lucky. For some reason that we don't know (it predates most of us), the tables on the slave d... [19:40:34] nuria_: --^ [19:41:19] 10Analytics, 10Collaboration-Team-Triage, 10EventBus, 10StructuredDiscussions: Does not appear edits in IRCfeed or Event Stream when non autopatrolled users added comment - https://phabricator.wikimedia.org/T187861#4028847 (10jmatazzoni) Hi @Etonkovidova. We came across this in Triage meeting. The decision... [19:41:27] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4028849 (10Ottomata) For posterity, here's the script I used: https://gist.github.com/ottomata/d69ba72313c44e8e45e6453f4ea97074 [19:41:39] elukey: let's set the mysql hostname in puppet to db1107 [19:41:40] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4028858 (10elukey) [19:41:42] so this doesn't happen again [19:42:27] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4026711 (10elukey) Before closing this task: 1) review the m4-master failover policy. 2) document this procedure on wikitech [19:42:39] ottomata|afk: I added some next steps to the task :) [19:45:10] all right, I think I can go home now :D [19:45:16] * elukey off! [19:45:26] Bye elukey - Thanks for the work!! [19:46:32] credits to the Kafka Master Andrew, I didn't do much :) [19:57:19] Indeed - ottomata|afk-San! [20:06:23] 10Analytics: Mediawiki History: moves counted twice in Revision - https://phabricator.wikimedia.org/T189044#4028997 (10JAllemandou) Super interesting finding !!! Let's discuss what to do with that. [20:09:40] 10Analytics: Mediawiki History: moves counted twice in Revision - https://phabricator.wikimedia.org/T189044#4028787 (10Nuria) >and so the editor numbers for geowiki will be considerably lower than those from mediawiki_history and Wikistats 2. The number of edits might change but not the number of editors. Right? [20:09:52] (03CR) 10Mforns: Add EL and whitelist sanitization (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [20:18:17] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#4029038 (10Ottomata) [20:30:37] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#4029086 (10Ottomata) [20:30:48] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#3907314 (10Ottomata) [20:35:16] !log pointing mediawiki monolog kafka producers at kafka jumbo-eqiad cluster: T188136 [20:35:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:35:32] T188136: Migrate Mediawiki Monolog Kafka producer to Kafka Jumbo - https://phabricator.wikimedia.org/T188136 [20:35:42] 10Analytics: Mediawiki History: moves counted twice in Revision - https://phabricator.wikimedia.org/T189044#4029093 (10Milimetric) >>! In T189044#4029016, @Nuria wrote: >>and so the editor numbers for geowiki will be considerably lower than those from mediawiki_history and Wikistats 2. > > The number of edits m... [20:41:42] 10Analytics-Kanban, 10Patch-For-Review: Include X-Client-IP in EventLogging data and geocode during Hive JSON Refinement - https://phabricator.wikimedia.org/T186833#4029101 (10Ottomata) [20:44:08] !log reverted change to point mediawiki monolog kafka producers at kafka jumbo-eqiad until deployment train is done T188136 [20:44:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:44:10] T188136: Migrate Mediawiki Monolog Kafka producer to Kafka Jumbo - https://phabricator.wikimedia.org/T188136 [20:44:11] milimetric: As far as I can tell, the pingback data still hasn't been created. Is that to be expected? I keep refreshing hoping to see it, but so far have been sad. [20:45:39] CindyCicaleseWMF: I was just looking into it, the job failed with access denied to the db but you seem to have set it up properly [20:45:58] trying to figure out what's up... it could be because of that outage today the server has been offline and not accepting queries [20:46:10] ah, ok - thanks for checking [20:47:09] Oh, is it using the correct creds file? [20:47:49] yeah, it's the right creds, I double checked that I can login using them... must be the outage [20:48:03] it should try again in 15 minutes and I'll double check then [20:48:48] ah! no, it didn't have access to the creds file, I'll fix it [20:49:37] I only have access to one of the two creds files (that I know of) on stat1006. I was testing with research-client.cnf. I thought I had changed the patch at one point to use stats-research-client.cnf as was suggested. [20:49:49] (03PS1) 10Milimetric: Change cred file to the one stats has access to [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/416768 [20:50:01] (03CR) 10Milimetric: [V: 032 C: 032] Change cred file to the one stats has access to [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/416768 (owner: 10Milimetric) [20:50:25] Thanks! Sorry, I thought I had fixed the patch, but I guess not. [20:50:52] no problem at all CindyCicaleseWMF, that was interesting, I didn't know we had two files with different access levels [20:51:48] Ah, I had changed it in patch set 4, but it got changed back in patch set 5. I was testing with the other one, and forgot that when I later changed config.yaml. [21:06:35] (03PS6) 10Milimetric: [WIP] Compute geowiki statistics for Druid from cu_changes data [analytics/refinery] - 10https://gerrit.wikimedia.org/r/413265 [21:07:13] I gotta run, doing my pSAT tutoring tonight, see yall tomorrow [21:47:46] milimetric: has it tried to run again yet? I still don't see any data appearing. [22:15:41] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4029237 (10RobH) Ok, All systems are installed with stretch. I need to reinstall analytics1076, as it had the wrong hostname set by a bad rever... [22:16:30] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4029238 (10RobH) [22:16:51] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad, 10User-Elukey: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4002756 (10RobH) ping @elukey: You can take over on ALL but analytics1076. I need to keep working on it for now. [22:17:01] 10Analytics-Kanban, 10Patch-For-Review: Remove sensitive fields from whitelist for QuickSurvey schemas (end of Q2) - https://phabricator.wikimedia.org/T174386#4029241 (10leila) @mforns we're ready to proceed with this task. Can you, for archive happiness, call out which fields will be dropped after the purge?... [22:17:14] elukey: heyas, if you are about, https://phabricator.wikimedia.org/T188294 is nearly done [22:17:30] analytics107[0-7] ready to go, except for 1076 [22:21:06] 10Analytics: Add trash folder to hadoop - https://phabricator.wikimedia.org/T189051#4029254 (10Nuria) [22:25:49] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Patch-For-Review: Bring back raw user_agent in EventLogging data so we can do further processing in Hadoop - https://phabricator.wikimedia.org/T188673#4029276 (10Ottomata) Ah! I was totally wrong! EventLogging's parsed userAgent already has `is... [22:41:28] Yay!! The pingback data was just produced, and the graphs are all working! They need a bit of tweaking, but this was a huge step forward. Thank you all for your help! [23:09:20] 10Analytics-Kanban, 10Patch-For-Review: Remove sensitive fields from whitelist for QuickSurvey schemas (end of Q2) - https://phabricator.wikimedia.org/T174386#3560006 (10Nuria) Ping @fdans to continue with task looks like @leila needs to review the whitelist changes and fields that will be kept [23:16:44] CindyCicaleseWMF: nice, let me look [23:31:46] nuria_: I have several questions - possibly enhancement requests. Shall I ask you here? [23:31:55] CindyCicaleseWMF: sure [23:33:04] Great! The numbers being represented in the graphs and tables are all counts. Is there a way to make them show up as integers (no trailing .0)? [23:34:21] And, what is the small box in the bottom left of the graphs for? I tried typing in different numbers, and it appears to make the graphs smoother, which in this case is not really desired behavior. Is there a way to suppress it? Or is it doing something important that I do not understand? [23:35:22] CindyCicaleseWMF: it is smothing for points, in this case there so few points it is not of use. I do not think is supresable (?) that seems ticket worthy as we can make it so [23:35:53] CindyCicaleseWMF: re;integers, i think so [23:36:12] I'd sort of like to see the actual counts for the numbers over 1,000 rather that XX.Xk - I think. But, as the numbers grow, that will not be good, so I think I can be talked out of that. [23:36:28] OK, I'll make a ticket for suppressing the box. [23:37:02] The last one, which is probably a ticket, too: Is there a way to make the header row of the table fixed so it doesn't scroll away as you scroll down the table to see the lower data rows? [23:39:24] CindyCicaleseWMF: you have formatter to kmb [23:39:27] right? [23:39:30] CindyCicaleseWMF: https://meta.wikimedia.org/wiki/Config:Dashiki:Pingback [23:39:32] yes [23:39:43] CindyCicaleseWMF: so kmb is kilobytes, megabytes [23:39:44] I originally omitted format, but everything appeared as a percent. [23:40:09] as far as I could tell, kbm and percent were the only options, and the default appears to be percent [23:40:17] CindyCicaleseWMF: ah i see, % must be default, that can be changed, one sec [23:40:46] should I remove the kmb? [23:43:30] CindyCicaleseWMF: look this one now: https://pingback.wmflabs.org/#media-wiki-version/media-wiki-version-timeseries [23:43:41] CindyCicaleseWMF: is this what you mean (y-axis) [23:44:13] Yes! Exactly! [23:44:30] ok, see my change on your config [23:45:03] CindyCicaleseWMF: numeral is used for formatting http://numeraljs.com/ [23:45:14] Cool! It would be great if that were documented :-) Unless I missed it. [23:45:56] I'll fix the rest of the config. [23:48:40] Nice! All fixed! Both reportupdater and dashiki are really great tools! I'm amazed at how quickly I was able to get these graphs up! [23:49:19] CindyCicaleseWMF: updated: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Dashiki#Configuring_the_tabs_layout [23:50:11] Excellent! [23:50:48] So, I'll submit a ticket for suppressing the box. Should I submit one for making the table header fixed when the table scrolls as well? [23:50:57] CindyCicaleseWMF: maybe you can send e-mail to pm lists and spread knowledge , not all data in EL is suitable to be graphed on dashiki but this type of "reporting" data is well suited for it [23:51:03] CindyCicaleseWMF: sure, yes [23:51:13] CindyCicaleseWMF: i cannot help you there cause i know no css [23:52:15] OK, I'll submit the tickets. I'd be happy to email about this. Which lists would you suggest emailing? I'm still learning all of the different communication channels . . .