[01:03:11] (03CR) 10Milimetric: [V: 032 C: 032] "You definitely got it basically right with your first patch, the rest is just formatting so it fits our idiosyncrasies. Ping us when you'" [analytics/reportupdater-queries] - 10https://gerrit.wikimedia.org/r/415798 (https://phabricator.wikimedia.org/T152222) (owner: 10Cicalese) [01:09:59] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Investigate listing the "Onboarding New Developers" KPIs on a custom dashboard - https://phabricator.wikimedia.org/T179329#4025656 (10Aklapper) [01:15:56] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): Investigate listing the "Onboarding New Developers" KPIs on a custom dashboard - https://phabricator.wikimedia.org/T179329#4025687 (10Aklapper) 05Open>03Resolved Creation worked now: `Saved Dashboard as "C_KPIs"` (`C_` prefix because... [01:23:44] 10Analytics-Kanban: English Wikivoyage traffic spike possible bot - https://phabricator.wikimedia.org/T187244#4025698 (10kaldari) 05Open>03Resolved @Tbayer: Nice sleuthing! BTW, now that the WikiVoyage Edit-a-thon is over, the pageviews have gone back to normal. [02:15:27] 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Move statistics::discovery jobs from stat1002 -> stat1005 - https://phabricator.wikimedia.org/T170471#4025790 (10mpopov) [04:01:43] PROBLEM - statsv Varnishkafka log producer on cp5010 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [04:03:43] RECOVERY - statsv Varnishkafka log producer on cp5010 is OK: PROCS OK: 0 processes with args /usr/bin/varnishkafka -S /etc/varnishkafka/statsv.conf [07:05:27] ah nice! Singapore vks! :) --^ [08:44:46] 10Analytics-Kanban, 10User-Elukey: Reboot all Analytics hosts for Kernel upgrade - https://phabricator.wikimedia.org/T188594#4026220 (10elukey) [08:45:12] elukey: nice hearing from them too! [08:45:37] oh yes! [08:47:27] ema, elukey: We haz hits from *.eqsin.wmnet in webrequest :) [08:48:00] happy times [08:48:04] joal: morning! [08:48:09] all good if I reboot archiva? [08:48:19] hi elukey :) Good for me :) [08:48:24] super [08:53:51] 10Analytics-Tech-community-metrics, 10Developer-Relations (Jan-Mar-2018): For new authors on C_Gerrit_Demo, provide a way to access the list of Gerrit patches of each new author - https://phabricator.wikimedia.org/T187895#4026231 (10Aklapper) 05Open>03Resolved "For new authors on C_Gerrit_Demo, provide a w... [08:54:37] archiva up and running [09:00:08] \o/ [09:10:41] * ema joins the celebrations by going for a coffee [09:41:23] !log stop eventlogging's mysql consumers for db1107 (el master) kernel updates [09:41:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:08:43] !log re-starting mysql consumers on eventlog1001 [10:08:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:13:06] eventlogging master db rebooted and mariadb upgraded by Manuel, all goood [10:17:34] elukey: Good job mate !! [10:19:15] !log restart webrequest-load-wf-upload-2018-3-6-7 (failed due to reboots) [10:19:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:19:39] joal: I didn't do anything! Saint Manuel from Madrid is the one that we should thank :) [10:30:38] Let do that then elukey :) [10:31:48] elukey: have you rebooted some analytics host this morning? [10:31:54] joal: yep [10:32:01] Ok :) [10:32:07] elukey: done for now? [10:32:46] elukey: I feel like this spark job I have always runs when we need to reboot machines :) [10:33:04] nope, I am about to reboot an106[8,9] :( [10:33:08] did I kill your job? [10:33:21] elukey: you didn't kill it, it failed for worng reasons [10:33:39] elukey: you've not rebooted an1069 yet? [10:33:44] nope [10:33:49] just stopped yarn [10:33:53] Ahhh :) [10:33:56] That explains :) [10:34:11] I always wait for all the jvms to complete, then I stop hdfs and finally reboot [10:34:50] elukey: my job uses a yarn-child daemon (spark-shuffle service), and therefore stopping arn makes my job fail [10:35:10] elukey: that's interesting however - it means the job is super-sensitive to failures (which I already knew) [10:35:41] elukey: I'll wait for the reboots to be done before restarting [10:36:44] joal: I am wondering if there is a different/better stategy to prepare nodes to shutdown [10:37:16] hm elukey - I actually don' know [10:37:44] something like sending a drain msg to the yarn node manager to not be available for more work except the jvms running [10:37:53] The only concern I have seen so far is when heavy spark jobs are runnig [10:38:44] I need to find a better or more automated way to reboot nodes, it takes too much time [10:38:59] hm [10:45:50] he team :] [10:45:57] *hey [10:46:48] joal, do you have 10 mins to save me from scala hell? [10:46:58] Hi mforns [10:47:03] mforns: To the cave! [10:47:06] ok! [10:52:31] rebooted all worker nodes except [10:52:58] 28/35/52 (journal nodes) - 62 (down due to dimm issues) [10:53:26] so now I can reboot the first three one at the time [11:02:06] elukey: good for me [11:04:21] just checked the hdfs ui, the quorum is ok [11:04:25] so I can start with 28 [11:07:12] then after lunch I'll do an1001/2 [11:07:16] so the cluster will be done [11:07:35] after that, druid's turn :P [11:27:32] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: consumer/mysql-m4-master-00 consumer/mysql-eventbus [11:27:51] this is me --^ [11:39:54] joal: I have a good news! (I am very sarcastic now :P) [11:41:58] so one step that was missed during the db upgrade was that it failed over to db1108 (analytics-slave) [11:42:16] and it didn't roll back to db1107 because it needs a manual action (wasn't aware of it) [11:42:29] soooo the mysql consumers have been writing to the slave [11:43:41] so now the situation is [11:47:22] Hi elukey [11:47:57] elukey: I'm assuming we'll have to make db1107 read from 1108, then swap back? [11:48:39] yes this is the idea that I have too [11:51:45] elukey: and in the mean time we'll have to sop EL onsumers I think (to prevent adding more rows to ) [11:51:54] already did yes [11:52:08] the issue might be trickier though, I am talking with Jaime now [11:52:30] ok elukey [11:52:42] elukey: where do we stand on hadoop reboots? [11:52:47] done with journal nodes? [11:52:58] only 35/52 left [11:53:56] I stopped after 28 due to the EL isuse [11:54:57] ok elukey - I'm afraid to restart my job before you reboot them :) [11:55:23] joal: please do since I am afraid my afternoon will be dedicated to el :( [11:55:32] elukey: ack ! [11:55:33] I'll reboot them tomorrow [12:01:06] Thanks elukey - Please let me know if I can help with EL (even if it's not my main area opf strength) [12:01:36] the main issue seems to be this one [12:01:45] 1) db1108 replication was stopped [12:01:58] 2) el kept inserting to db1107 incrementing ids [12:02:08] 3) el stopped, no more inserts [12:02:16] 4) db1107 stopped and rebooted [12:02:39] 5) el restarted, inserts to db1108 [12:02:51] 6) mistake found, el stopped [12:02:54] Oh waow ... That's not coll [12:03:02] hm [12:03:31] the main issue is that now both hosts have incremented some ids that the other one doesn't have [12:03:40] so data is not lost but in a weird state [12:04:18] elukey: manual scripts to gather differences (using timestamps) between 7 and 8, and reimport from 7 to 8 only needed values [12:05:20] elukey: :( [12:05:54] joal: or something like - manual script to get differences and import them from 8 to 7, then delete on 8 that data and leave the replication to do the rest [12:06:53] elukey: if we're happy with 7 being master, yes, would do [12:07:33] yes 7 is the master [12:07:49] and 8 usually the slave :D [12:07:56] today it got promoted :D [12:08:12] huhu [12:08:31] I don't understand how we endedup insterting into 8 though [12:08:54] so we have a dbproxy for m4-master.eqiad.wmnet (that we don't manage) [12:09:00] (we == analytics) [12:09:20] once db1107 was stopped, it failed over to db1107 [12:09:22] Ohhh, when 7 rebooted (stage 4), 8 got promotted [12:09:23] err db1108 [12:09:35] right [12:09:53] the issue was that 07 was not restored as m4-master [12:10:41] makes sense elukey [12:10:49] I guess that we need more people to understand what to do [12:12:43] elukey: I think writing a script moving all data inserted onto 8 to 7, then delete it from 8, would be the correct thing to do [12:17:53] elukey: Wow, something else I just discovered - I have eventlogcleaner-cron error emails in my spam box !!! [12:46:11] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4026711 (10elukey) p:05Triage>03High [12:47:26] 10Analytics, 10EventBus, 10MediaWiki-JobQueue, 10Services (doing): Migrate RefreshLinks job to kafka - https://phabricator.wikimedia.org/T185052#4026731 (10mobrovac) [12:53:25] taking a break a-team [12:56:41] all right alerted analytics@ [12:56:51] tracking task is https://phabricator.wikimedia.org/T188991 [12:57:59] going to grab a quick lunch! [13:01:24] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4026817 (10elukey) [14:03:41] I am back [14:07:55] hiiii [14:09:00] hello ottomata! [14:09:25] as you may have already seen in the emails there is an interesting issue for eventlogging :( [14:09:33] i haven't! just trying to go through emails [14:09:43] but need to prep for webrequest text switchover [14:09:45] we still ok for that? [14:10:11] not sure, I'd prefer to fix the EL issue with you if possible [14:10:23] we scheduled this time with jeff green [14:10:24] is all [14:10:33] ah ok didn't know that [14:10:37] https://phabricator.wikimedia.org/T185136#4014394 [14:10:42] sure then [14:10:47] k [14:11:03] looking at email [14:12:04] elukey: since this is custom repl [14:12:11] couldn't you just run the replication scripts backwards for a bit? [14:12:16] slave -> master? [14:12:35] it compares based on latest id/timestamp [14:12:44] the main issue is that there should be data pushed to db1107 that was not yet consumed by the replication [14:12:47] so as long as you haven't inserted newer data already in master [14:12:51] oh [14:12:52] yeah [14:12:52] hm [14:13:07] I didn't check all the tables, I wanted to chat with you first [14:13:24] I can check the discrepancies now [14:13:28] you can move consumer groups somehow...you'd have to do it manually with some code, but that would work ok, i think [14:13:35] you'd have to find the proper offsets [14:13:39] for each partition [14:13:57] hm [14:14:44] it was a misunderstanding between me and Manuel, I thought that the dbproxy was set to master and he thought that I would have done it (and also that db1108 was read only) [14:14:48] so we ended up in this [14:15:01] aye [14:15:23] but in theory it can happen anytime if there is a failover [14:15:44] it auto fails over? [14:15:52] yep [14:15:55] maybe we should not use the proxy name in the mysql writer [14:18:44] 10Analytics, 10Analytics-Cluster, 10Analytics-Kanban, 10Patch-For-Review: Move webrequest varnishkafka and consumers to Kafka jumbo cluster. - https://phabricator.wikimedia.org/T185136#4027077 (10Ottomata) [14:18:51] going to check max(timestamp,id) on all the el tables [14:18:56] k [14:20:01] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4027078 (10Ottomata) > We run a proxy in front of the eventlogging database, called m4-master If we can't write to the failover, then we probab... [14:20:26] elukey: i'm waiting for jeff now anyway, i'll look into that offset reseting thing [14:27:13] hi ottomata [14:27:16] hayyyy [14:27:17] ok [14:27:21] https://phabricator.wikimedia.org/T185136 [14:27:25] https://gerrit.wikimedia.org/r/#/c/416683/ [14:27:30] if i merge and apply ^ [14:27:49] no more messages will go to analytics eqiad cluster, but all to jumbo-eqiad [14:27:58] then we let kafkatee finish consuming from the old cluster.... [14:28:10] then flip the host list, and whack the state file [14:28:19] and that should be it, right? [14:28:46] yup [14:29:21] joal: yt? https://gerrit.wikimedia.org/r/#/c/415636/ [14:29:27] ok, I think you can merge anytime, I'll puppetize the hostlist change [14:29:29] i'm gonna merge that now [14:29:34] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4027191 (10Marostegui) >>! In T188991#4027078, @Ottomata wrote: >> We run a proxy in front of the eventlogging database, called m4-master > > I... [14:29:35] ok, Jeff_Green got a couple of prep steps, will letcha know [14:29:39] ok [14:34:20] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4027223 (10elukey) >>! In T188991#4027191, @Marostegui wrote: >>>! In T188991#4027078, @Ottomata wrote: >>> We run a proxy in front of the event... [14:35:42] elukey: ok, getting ready to do this, ya? [14:35:49] i'm going to stop puppet on text vks [14:36:16] ottomata: ack, everything seems good [14:36:27] alerted the traffic team [14:36:35] danke [14:39:37] ok Jeff_Green, elukey I'm merging. puppet is disabled. will run on a single host, test, and then batch run puppet on remaining cache text hosts over a few minutes [14:39:43] ok [14:40:33] ack [14:46:11] 10Analytics-Kanban, 10Operations: Eventlogging mysql consumers inserted rows on the analytics slave (db1108) for two hours - https://phabricator.wikimedia.org/T188991#4027272 (10Marostegui) >>! In T188991#4027223, @elukey wrote: >>>! In T188991#4027191, @Marostegui wrote: >>>>! In T188991#4027078, @Ottomata wr... [14:46:56] ottomata: kafka-consumer-groups on 0.11 seems to have some options to play with offsets [14:47:12] ok, cumin batch running, text should be fully migrated in 10ish minutes [14:47:18] super [14:47:36] there is kafkatee on oxygen to monitor/switch right? [14:47:43] ah no wait we have two instances [14:47:47] you already got it [14:48:26] yup [14:48:37] there is the banner impression streaming job to restart [14:48:50] i've already made a puppet change so that the stream_check thing will use jumbo [14:48:58] saw it yes, already merged? [14:49:00] ya [14:49:12] once done i guess we can just kill the job and watch the check restart it? [14:49:21] so it is only a matter of killing the job, it will be restarted [14:49:25] yeah [14:50:43] elukey: what's the earliest timestamp of a el mysql event that you could use as a place to restart the consumers from? [14:50:50] i'll try and figure out how we can re-set the offsets [14:52:28] ottomata: this is a good question, not sure [14:52:32] I build https://meta.wikimedia.org/wiki/Config:Dashiki:Pingback and tested it locally as best as I could without data being available yet. It appears to work well. Any chance somebody could kick off reportupdater to generate the pingback data? [14:54:04] from the sal I've stopped the mysql consumers (the first time) around 9:40ish [14:54:09] UTC [14:54:41] but I am checking last timestamps on db1107 now [14:55:14] most of them around that hour, a couple of minutes before [14:55:25] now the main issue though is the id auto-increment no? [14:55:31] ok, i mean, a little bit before is fine, since dups will be handled [14:55:33] I'm on it, CindyCicaleseWMF, I had forgotten one little piece last night or else the reportupdater job would've already been running [14:55:37] its going to be imprecisie anyway [14:55:56] ya, i'm working on offsets for consumer to restart for master [14:56:04] something else will need to be done to delete from slave [14:56:21] yes this is the issue [14:57:20] milimetric, thanks! Eagerly awaiting the pretty graphs :-) [14:57:53] ottomata: https://grafana.wikimedia.org/dashboard/db/kafka-by-topic-prometheus?refresh=5m&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kafka_jumbo&var-kafka_broker=All&var-topic=webrequest_text [14:57:57] wow :) [14:58:37] :) [14:58:59] all right lemme kill banner impression then [15:00:37] elukey: wait [15:00:41] puppet not done yet [15:00:42] still going [15:01:13] it is fine, it will be restarted in 5 min IIRC by the cron, but I need to update hiera right? [15:01:22] (already killed it sorry) [15:01:35] elukey: it will start consuming from jumbo, but there will still be new messages in analytics [15:01:52] i guess it doesn't matter because the batch job replaces data, right? [15:02:16] elukey: hiera already updated [15:02:20] super [15:02:22] ottomata: whenever you have a sec, I need a merge on a ReportUpdater job: https://gerrit.wikimedia.org/r/#/c/416698/ [15:02:38] thx much [15:02:47] done [15:04:27] hm, actually, puppet is almost done on all, why only 1.5K/s? [15:04:28] hm [15:04:45] OH [15:04:45] hahah [15:04:51] elukey: i'm running puppet on uploads [15:04:52] DOH [15:05:10] oook, running puppet on texts now.... [15:05:28] ok, it'll be another 10 mins then... [15:06:06] ottomata: ahahah I was about to ask, wasn't seeing the sequence numbers dropping to zero.. [15:06:12] (on vks) [15:06:14] aye [15:10:55] ottomata: I might have an idea for the el db mess [15:12:13] that might work [15:12:57] whenever you are ok I can brainbounce it with you [15:12:58] ya? [15:13:09] bc? [15:14:06] k [15:20:26] (03PS19) 10Mforns: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) [15:20:56] ok, CindyCicaleseWMF, the dashboard config is merged into dashiki, the job is configured, and the dashboard is deployed: https://pingback.wmflabs.org/#media-wiki-version [15:21:07] the only thing remaining is for the data to get created [15:21:39] so at some point today-ish you'll start seeing data and it'll keep updating until it's caught up [15:21:43] then it'll run weekly [15:21:44] milimetric: great!! [15:21:53] thanks for all of your help!! [15:22:45] (03CR) 10jerkins-bot: [V: 04-1] Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [15:24:22] no problem at all, dashboards are easy. Collecting the right data to answer questions is hard [15:29:25] Jeff_Green: puppet done [15:29:34] no text in kafka analytics anymore [15:29:39] https://grafana.wikimedia.org/dashboard/db/varnishkafka?panelId=13&fullscreen&orgId=1&from=now-1h&to=now&var-instance=webrequest [15:29:41] do the kafkatee thing [15:29:42] confirmed [15:29:43] all good [15:29:46] \o/ [15:29:47] ok, logs seem to have stopped collecting [15:31:19] Hey ottomata / elukey [15:31:31] Looks like the move to jumbo worked like a charm ! [15:32:50] it seems so :) [15:33:03] vk is performing well, tls rtt stable and no errors registered [15:33:09] \o/ ! [15:33:28] ottomata: Thanks for the path on streaming cron ! [15:33:36] ottomata: I did see i, but did not +1 :( [15:33:46] seems to be working! [15:35:18] elukey/ ottomata - This however looks weird [15:35:19] https://yarn.wikimedia.org/proxy/application_1518549639452_78067/streaming/ [15:35:25] great! [15:35:46] joal: maybe job needs restated? [15:35:47] elukey: ? [15:36:19] ottomata: I have an idea [15:36:33] i betcha it got restarted with old params [15:36:36] can we kill it? [15:36:47] ottomata: nope, confirmed it got restarted with good params [15:36:51] oh [15:36:52] ? [15:37:02] ottomata: however spark-kafka uses zookeeper !!!! [15:37:05] !!! [15:37:16] yeah, but that should be ok...as long as it odesn't do it on its own terms [15:37:24] Arf [15:37:27] Maybe not then [15:37:28] the kaf ka zoookeper commit should do it in a different zookeeper chroot [15:37:43] but how does it use zookeeper? [15:37:50] we don't provide it the zk connect string [15:37:51] do we ? [15:38:22] ottomata: I'm reviewing the options, and it seems we use ZK onlu for druid [15:38:29] plus a worker cannot connect to zk [15:38:37] it is firewalled [15:39:16] ottomata: I think I have an idea: since we've restarted the job with checkpointed stat, I think it read the config from the heckpoint [15:39:40] let me kill and manually restart withtout checkpoint, then kill again for automagic restart [15:39:53] 10Analytics-Kanban, 10Operations, 10ops-eqiad: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4027495 (10Cmjohnson) I swapped the B side DIIMM to the A side to see if the error returns and follows the DIMM. Powered server on, let's check back in a day or so. [15:40:49] hm [15:40:50] oook... [15:43:47] Jeff_Green: great, easy peasy then, we good, right? [15:43:55] as far as I can tell, yes [15:44:09] greaat, thanks! [15:44:15] thank you! [15:46:24] ottomata: The chart of data flowing into jumbo is super awesome :) [15:47:10] elukey: is anyone helping you with data recovery? [15:50:07] milimetric: i'm helping [15:50:34] k, sweet [15:52:07] 10Analytics-Kanban, 10Operations, 10ops-eqiad: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4027595 (10elukey) Thanks! [15:52:08] in the meantime, an1062 is back [15:56:19] will be 5 min late to standup, sorry a-team [16:00:32] (03PS20) 10Mforns: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) [16:00:49] trying to join... [16:01:25] ping milimetric [16:02:42] (03CR) 10jerkins-bot: [V: 04-1] Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [16:03:45] PROBLEM - Number of banner_activity realtime events received by Druid over a 30 minutes period on einsteinium is CRITICAL: CRITICAL - druid_realtime_banner_activity is 0 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=41&fullscreen&orgId=1 [16:05:45] RECOVERY - Number of banner_activity realtime events received by Druid over a 30 minutes period on einsteinium is OK: OK - druid_realtime_banner_activity is 375 https://grafana.wikimedia.org/dashboard/db/prometheus-druid?refresh=1m&panelId=41&fullscreen&orgId=1 [16:05:57] ottomata: --^ [16:06:03] ottomata: We're back on track [16:06:22] ottomata: I'll kill my manual restart to let the automatic one do its job [16:07:39] ottomata: Learning point here - Checkpointed streaming jobs don't update their conf [16:08:16] hmmm k [16:08:19] thanks joal [16:09:26] ottomata: it makes sense - Like, part of the saved state are kafka offsets, so when changing kafka brokers, doesn't make sense to keep them etc [16:09:43] (03PS21) 10Mforns: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) [16:12:39] (03CR) 10jerkins-bot: [V: 04-1] Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [16:19:32] (03CR) 10Ottomata: Add EL and whitelist sanitization (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [16:22:42] (03CR) 10Mforns: Add EL and whitelist sanitization (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) (owner: 10Mforns) [16:23:39] (03PS22) 10Mforns: Add EL and whitelist sanitization [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/412939 (https://phabricator.wikimedia.org/T181064) [16:32:39] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4027752 (10Ottomata) Bump, what's the status on these? [16:48:36] 10Analytics-Cluster, 10Analytics-Kanban, 10Operations, 10ops-eqiad, and 2 others: rack/setup/install analytics107[0-7] - https://phabricator.wikimedia.org/T188294#4027808 (10Cmjohnson) @ottomata: these needs installs if you have the spare cycles feel free. On-site work is done [16:58:02] 10Analytics-Kanban, 10Patch-For-Review: Spark 2.2.1 as cluster default (working with oozie) - https://phabricator.wikimedia.org/T159962#4027831 (10Ottomata) Oof, joal, ya, copied spark-assembly instead of oozie-sharelib. Fixed now. [17:03:31] elukey: i got a script working for recomitting offsets i think [17:04:21] https://gist.github.com/ottomata/d69ba72313c44e8e45e6453f4ea97074 [17:04:22] lunching... [17:07:17] super [17:07:26] I am going through the tables to purge [17:07:50] so on the master on ~90ish have 1) a timestamp field 2) that is 20180306etc.. [17:08:42] I am assuming that a recent table (that gets new inserts) must have the timestamp field (not sure about the eventbus ones, still need to check) [17:10:02] so in the end it could be as easy as delete from %table where timestamp > 20180306093000 [17:10:21] (on the slave) [17:10:37] ottomata: what timestamp did you use for your offsets? [17:11:25] elukey: about 9:35 [17:11:48] oh eventbus ones...right oof [17:11:51] i only did valid mixed [17:11:52] hm [17:17:13] milimetric: does Amir know you have productionised his data? [17:18:37] milimetric: I ask since I have noticed some queries lately that look like the usual unprodutionised cross-navigation ones [17:20:12] ottomata: there is also something that I don't understand.. I tried to do max(timestamp) on all the tables on db1108, and the max one is 20180306101617 only on few of them [17:20:18] I expected more [17:21:20] I mean, I've restarted mysql consumers around 10:08 [17:21:34] and max timestamp is only 10:16? [17:27:16] 10Analytics, 10EventBus, 10MediaWiki-General-or-Unknown, 10Multi-Content-Revisions, 10Services (watching): It should be possible to understand the reason of revision creation from RevisionRecordInserted hook - https://phabricator.wikimedia.org/T188396#4027964 (10Addshore) [17:28:14] ok maybe I am crazy [17:28:33] so, if I go to evenlog1001's m4-consumer logs I can see this as last entry [17:28:36] 2018-03-06 11:24:38,168 [24623] (MainThread) Inserted 15 UniversalLanguageSelector_7327441 events in 0.009351 seconds [17:28:41] good [17:28:53] now on the EL slave [17:28:54] elukey@db1108:~$ sudo mysql --skip-ssl <<< 'select max(timestamp) from log.UniversalLanguageSelector_7327441' [17:28:57] max(timestamp) [17:29:00] 20180306095650 [17:29:23] so I assume that EL, due to processing time etc, did not process a ton of data in those two hours [17:29:54] ottomata: --^ [17:29:58] does it make sense? [17:30:06] I am just sanity checking what's in the db [17:30:30] if so, I can generate the list of delete statements easy [17:30:41] 1) grab the list of tables with timestamp field [17:30:59] 2) delete from table where timestamp > 09:30 [17:31:29] (sorry am half afk as i make lunch...) hm [17:32:21] ah yes okok :) [17:32:23] elukey: hm not sure [17:32:39] is that true of all tables? [17:32:47] they have about a 9:56 as latest timestamp? [17:32:56] we can check a few dts for messages at currently committed offset in kafka [17:33:49] elukey: partition 0 for eventlogging-valid-mixed mysql consumer group is at offset 760189269 [17:33:57] kafkacat -b kafka1012:9092 -C -t eventlogging-valid-mixed -p 0 -o 760189269 -c 1 | jq .dt [17:33:57] "2018-03-06T11:24:38" [17:34:20] that's MediaViewer [17:34:32] MediaViewer_10867062 [17:34:36] what's the latest timestamp in that table? [17:34:54] elukey@db1108:~$ sudo mysql --skip-ssl <<< 'select max(timestamp) from log.MediaViewer_10867062' [17:34:57] max(timestamp) [17:35:00] 20180306093949 [17:36:29] Gone for diner, back after [17:36:49] joal: he may be trying to compare to vet the new data? But yeah, he definitely knows [17:37:09] ottomata: how did you find 760189269 ? [17:37:17]