[07:16:55] hello people! [07:17:38] if you don't mind I'll do the hadoop master failover to pick up the new jvm [07:17:56] and probably restart a couple of kafka brokers as well for the same reason [10:08:45] 06Analytics-Kanban, 10DBA, 06Operations: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294201 (10elukey) Hello @Marostegui, thanks a lot for the heads up! I checked `megacli -AdpBbuCmd -a0` again and this is the status: ``` BBU Capacity Info for Adapter: 0 Relative State of Charg... [10:09:02] 06Analytics-Kanban, 10DBA, 06Operations, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294202 (10elukey) [10:09:13] 06Analytics-Kanban, 10DBA, 06Operations, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286435 (10elukey) a:03elukey [10:10:50] 06Analytics-Kanban, 10DBA, 06Operations, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294205 (10Marostegui) Hello, If you are planning to keep that host for a long time (which I assume so) - I would definitely replace the BBU. I think @Cmjohnson might have spares fr... [10:12:45] 06Analytics-Kanban, 10DBA, 06Operations, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286435 (10jcrespo) Everything you say is correct. We are decommissioning many 06Analytics-Kanban, 10DBA, 06Operations, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294211 (10elukey) Yes let's replace the BBU, will wait for a confirmation from @Cmjohnson then! [10:13:38] 06Analytics-Kanban, 10DBA, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294212 (10Marostegui) [10:48:16] PROBLEM - Hadoop DataNode on analytics1030 is CRITICAL: Return code of 255 is out of bounds [10:48:46] PROBLEM - Disk space on Hadoop worker on analytics1030 is CRITICAL: Return code of 255 is out of bounds [10:48:48] PROBLEM - Hadoop NodeManager on analytics1030 is CRITICAL: Return code of 255 is out of bounds [10:50:46] I am reimaging it --^ [10:52:16] PROBLEM - YARN NodeManager Node-State on analytics1030 is CRITICAL: Return code of 255 is out of bounds [11:11:16] RECOVERY - YARN NodeManager Node-State on analytics1030 is OK: CRITICAL: YARN NodeManager analytics1030.eqiad.wmnet:8041 Node-State: LOST [11:11:46] RECOVERY - Disk space on Hadoop worker on analytics1030 is OK: DISK OK [11:18:39] hello 1030! [11:19:04] I am currently finishing the work to reimage it [11:19:16] daemons are masked so yarn/hdfs will not come up [11:19:26] going to lunch and then I'll put it in production again [12:00:52] hey fdans [12:01:23] I'll be in the cave in a bit, just getting some socks [12:02:20] milimetric: awesome, be there in 2min [12:11:52] RECOVERY - Hadoop NodeManager on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [12:12:21] RECOVERY - Hadoop DataNode on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [12:38:09] ottomata: hhiiiiiii [12:38:34] just reimaged an1030 and puppet complained about not finding python-numpy v1.12.0-2~bpo8+1 [12:38:49] in the jessie backports I can see a newer version [12:39:07] so what I did was grab the version 1.12.0-2~bpo8+1 from apt cache on an1031 and manually install [12:39:22] but we might want to revise that version in puppet [12:39:30] package { ['python-numpy', 'python3-numpy']: [12:39:31] ensure => '1:1.12.0-2~bpo8+1', [12:39:31] } [12:39:44] !log re-added analytics1030 to the hadoop workers [12:39:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:45:53] Doing the Hadoop master failover people [12:45:57] for jvm upgrades [12:47:14] hIIiI [12:47:31] HMMMMMM [12:47:43] elukey: yeah i bet that would be fine [12:47:47] to bump the version [12:47:57] i think we added that because whatever was available wasn't high enough [12:48:00] so something newer should be fine [12:48:42] we could pin backports with higher priority for python-numpy [12:49:22] (in the meantime, 1002 is the new hadoop master) [12:53:46] and 1001 is back the master [12:54:01] !log restarted master Hadoop daemons for jvm upgrade [12:54:02] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:57:26] 06Analytics-Kanban, 10DBA, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294405 (10Ottomata) How soon is likely to happen? Early next FY or later? If within Q1, I'd say let's just wait and replace the box. Otherwise, let's fix the BBU. Eh? [12:59:11] 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Update kafka.sh wrapper script for Kafka 0.10+ - https://phabricator.wikimedia.org/T166164#3287101 (10Ottomata) [12:59:45] 06Analytics-Kanban, 10DBA, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294408 (10Marostegui) It is probably worth saying that the BBU might have been broken for a long time. We noticed because of the new check, but it would be too much of... [13:00:50] 06Analytics-Kanban, 10DBA, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294409 (10jcrespo) I agree with Manuel. while I would like to do the replacement ASAP, in reality it is not going to happen until Q2 or later. [13:00:54] 06Analytics-Kanban, 10DBA, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294410 (10Ottomata) > for a long time still. Agree but how long! It is slated for replacement next FY year sometime, right? Maybe we can just do it sooner rather th... [13:02:14] 06Analytics-Kanban, 10DBA, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294411 (10jcrespo) The reasoning is that labsdb has priority, and it is even on the best interest of analytics to to that first, if I understood correctly CC @Nuria [13:06:18] 06Analytics-Kanban, 10DBA, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294413 (10elukey) @Ottomata if Chris finds a BBU that among the spare parts that we have I'd say that we can do it asap, it should be a relatively painless downtime fo... [13:08:42] 06Analytics-Kanban, 10DBA, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294415 (10Marostegui) Another tip, once it is replaced (if it is) try to monitor its temperature once it boots up - in the last few weeks during some server moves we n... [13:09:05] 06Analytics-Kanban, 10DBA, 06Operations, 10ops-eqiad, 15User-Elukey: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3294419 (10Ottomata) +1 [15:08:42] 06Analytics-Kanban: Create yaml UI configuration files for Standard Metrics - https://phabricator.wikimedia.org/T166387#3294599 (10Milimetric) [15:08:44] 06Analytics-Kanban: Initial Launch of new Wikistats 2.0 website - https://phabricator.wikimedia.org/T160370#3294612 (10Milimetric) [15:10:35] 06Analytics-Kanban: Create yaml UI configuration files for Wikistats metrics - https://phabricator.wikimedia.org/T166388#3294619 (10Milimetric) [15:10:37] 06Analytics-Kanban: Initial Launch of new Wikistats 2.0 website - https://phabricator.wikimedia.org/T160370#3096561 (10Milimetric) [15:10:50] 06Analytics-Kanban: Create yaml UI configuration files for Standard Metrics - https://phabricator.wikimedia.org/T166387#3294635 (10Milimetric) [15:30:58] fyi monday is memorial day holiday [15:31:30] mforns: o/ [15:31:34] I forgot to ask you something [15:31:37] elukey, yep! [15:31:53] say for some reason terminator.purge/sanitize for table X returns a weird exception [15:32:07] atm we just stop right there the execution of the script [15:32:11] logging what's happened [15:32:14] aha [15:32:50] we only catch db related exceptions in database.execute and return empty result in case [15:32:54] I think it is acceptable [15:34:10] mmmm [15:34:45] elukey, have a meeting now, can I look into this when finished? [15:35:05] sure [15:35:17] I'll update the code review so you'll see the test [15:37:07] elukey, I think errors when purging or updating should be visible enough, so that we analytics can react quickly, no? [15:37:45] if we catch the SQL errors and just log them... maybe we overlook some errors, and then we could have sensitive data stored for long time [15:42:03] mforns: sure, we can think about a way to alarm it [15:59:05] 10Analytics-Cluster, 06Analytics-Kanban, 13Patch-For-Review: Genericize ca-manager script - https://phabricator.wikimedia.org/T166167#3287261 (10Ottomata) a:03Ottomata [16:00:32] elukey, done :] [16:00:47] finishing the tests, will be ready in ~10 to update [16:00:56] maybe the easiest way to alarm it would be just letting it die if some error is raised [16:01:08] k [16:12:18] milimetric, can I ask you about mediawiki hive data? :] [16:12:38] if you're having lunch just ignore me [16:13:02] mforns: this is the test_terminator.py https://gerrit.wikimedia.org/r/#/c/355604/3/eventlogging_cleaner/tests/test_terminator.py [16:13:05] :D [16:13:07] still to complete [16:13:14] hehe ok [16:13:31] but I found a good skeleton to mock and test calls to the database.execute() [16:14:17] need to mock sanitize and add few cases [16:14:22] but overall it seems good [16:20:49] hello, what would it take to formally request access to web request log analytics? [16:21:28] we regularly have questions about the usage of specific REST API end points, for which access to the logs would be very helpful [16:23:28] gwicke: hello! I think it is a regular access request for the analytics private group [16:23:45] I mean, nothing more than that [16:24:30] And is there some docs on how to actually run a query on hadoop? [16:24:37] more specifically - analytics-privatedata-users [16:24:50] Pchelolo: sure! Let me grab them [16:24:51] access request as in ticket tagged with access request, to be discussed in the ops meeting? [16:25:21] no need for the ops meeting, it doesn't contain sudo [16:25:42] all it takes is analytics approval and manager approval for the requestor :P [16:25:45] requester [16:25:54] so a bit of things to read [16:25:55] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Beeline [16:26:04] https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Queries [16:26:27] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic [16:26:35] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest [16:26:41] Pchelolo: --^ [16:27:03] Awesome! Thank you :) I have a nice morning read now [16:27:43] we have also pivot.wikimedia.org, but it contains "only" pageviews and we are experimenting with sampling webrequests in druid [16:27:54] so you can also check that one [16:28:17] the only general reccomendation is to be "gentle" when issuing queries to big volumes of data [16:28:22] does it allow filtering by path? [16:28:34] nope [16:28:51] for that querying webrequests is the best option [16:28:57] yeah, that would not be that useful [16:29:42] gwicke: are you gonna create a ticket for access or I do? [16:29:46] created https://phabricator.wikimedia.org/T166391 [16:30:58] 10Analytics, 06Services (blocked): Analytics access request: pchelolo, mobrovac, gwicke - https://phabricator.wikimedia.org/T166391#3294905 (10GWicke) [16:31:06] elukey: thanks! [16:31:40] 10Analytics, 06Operations, 10Ops-Access-Requests, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3294928 (10elukey) [16:31:50] added the tags --^ [16:31:54] yw! [16:32:47] elukey, EL cleaner looks really good! Hey, what do you want me to do? comment on the patch right now? wait until you ping me to review? help you writing code? other? [16:33:56] mforns: nono just check if you see horrible things, it was to give you an idea about the corner cases that I am testing.. should be ready for a big review Monday/Tuesday probably! [16:34:14] elukey, ok will do :] [16:34:17] the other thing that I need to figure out is the authentication [16:34:22] but should be easy [16:34:23] thanks! [16:34:25] aha [16:34:39] yea, I don't know about this... [16:35:28] mforns: ah and Jaimed told me about https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlcursor-execute.html [16:35:41] that it would be better not to pre-format() the queries [16:35:58] reading [16:38:01] elukey, so using params={...} instead of string.format() [16:38:04] ? [16:38:35] makes sense [16:42:58] yeah it needs a bit of refactoring but it should be ok.. [16:43:03] I don't think a big win though [16:43:19] I mean, it is handy that cursor have this feature but... [16:43:26] anyhow, we'll think about it [16:43:31] going afk now people! [16:43:36] have a nice weekend :) [16:43:55] k, nice weekend elukey! [16:44:01] o/ [17:00:51] 10Analytics, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 06Performance-Team, 07Spike: Spike: Investigate alternatives to Special:HideBanners cookie storm for cross-domain banner close-button - https://phabricator.wikimedia.org/T117433#3295037 (10Krinkle) p:05Triage>03Normal [17:03:36] 10Analytics, 06Operations, 10Ops-Access-Requests, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3295046 (10RobH) a:05Nuria>03None This seems pretty straightforward, and I'll be on clinic next week, so just commenting the checklist: [x] - all users listed @P... [17:46:57] mforns: hey, sorry [17:47:00] what's up [17:47:06] I was at lunch and then 1/1 [17:47:13] hi milimetric np, totally [17:47:17] emmmm [17:48:43] yes, I've seen that mediawiki_ipblocks external table in hive has the data in place of the last import, but not the partition registered in hive metastore, is that expected? do you know sth about this? [17:50:38] hmmm [17:51:07] mforns: actually all the other mediawiki tables are gone, like wmf.mediawiki_page and so on [17:51:16] I noticed that at the hackathon and was confused [17:51:55] ??? [17:52:00] maybe the tables were destroyed and we never automated creating them? [17:52:24] milimetric, those are in wmf_raw [17:52:30] oooh [17:52:31] doh [17:53:47] I learned that from the task description, maybe joseph changed the destination of the scoop job? [17:54:30] mforns: then I'm not sure what you mean, "show partitions mediawiki_ipblocks;" gives me the partitions [17:54:51] oh I see, none from the 2017-04 snapshot, right? [17:55:08] anyway, so, if I go to hive and exec: show partitions mediawiki_ipblocks; there's no partition for 2017-04 listed, but if I go to /wmf/data/raw/mediawiki/tables/ipblocks and ls, there's a directory for snapshot=2017-04 [17:55:11] yes [17:55:15] got it [17:55:23] then yes, I know nothing about it at all [17:55:41] I do know he restarted the jobs to re-label them [17:55:43] k, do you think I can try to msck repair table? [17:56:03] like it used to be 2017-04 but only include data before 2017-04-01 [17:56:11] yeah, definitely, repair it [17:56:16] because... this is not expected, right? it should be fixed, no? [17:56:19] that never hurts as far as I know [17:56:23] yes, it shouldn't be like this [17:56:29] k, will try [17:56:35] it means it's probably breaking the denormalize job [17:56:36] thanks! [17:56:42] aha [17:56:47] yes, makes sense [17:56:55] but in an ugly way 'cause it just wouldn't have data so we'd never know about it [17:57:13] good find! also - let's figure out a way to monitor this, I'll make a task [17:58:16] 10Analytics: Monitor if/when mediawiki history reconstruction partitions and imports fall out of sync - https://phabricator.wikimedia.org/T166405#3295286 (10Milimetric) [17:58:34] milimetric, hive (wmf_raw)> msck repair table mediawiki_ipblocks; [17:58:34] FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask [17:59:04] 10Analytics, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3295298 (10Nuria) Approved on my end. I actually though all three had access already. [17:59:23] wait [17:59:51] run as hdfs? [18:01:35] yes did [18:01:48] tried beeline as well.. but no [18:19:40] 10Analytics, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3295382 (10RobH) a:03GWicke @Gwicke: Can you please review and sign the L3 document? Once done, please feel free to assign to me for followup... [18:22:25] 10Analytics, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3295392 (10GWicke) a:05GWicke>03RobH @RobH, signed it earlier. Thanks for the quick follow-up! [18:28:20] bearloga: thanks for your contributions to teh tagging! [18:28:22] *the [18:28:40] 10Analytics, 06Operations, 10Ops-Access-Requests, 13Patch-For-Review, 06Services (blocked): Analytics access request - https://phabricator.wikimedia.org/T166391#3295411 (10RobH) Thanks! Pending no objections (and I don't think there will be any), I'll merge this live on Wednesday! [18:33:06] nuria_: you're welcome! thank you for the work you're doing on this project! it will be really helpful for us! [19:41:58] 06Analytics-Kanban: Code Review Needed: New data produced on https://analytics.wikimedia.org/datasets/ - https://phabricator.wikimedia.org/T165944#3295533 (10Nuria) Understood, I think all looks good for that data (cc @mforns for 2nd opinion), please be so kind to send us a ticket before data is public next time. [19:56:48] 06Analytics-Kanban: Code Review Needed: New data produced on https://analytics.wikimedia.org/datasets/ - https://phabricator.wikimedia.org/T165944#3295544 (10mforns) @Nuria Yes, data looks non-sensitive at all. @GoranSMilovanovic I don't know if it fits your use case, but there's this program we Analytics devel... [20:02:28] 06Analytics-Kanban: Code Review Needed: New data produced on https://analytics.wikimedia.org/datasets/ - https://phabricator.wikimedia.org/T165944#3295555 (10Nuria) 05Open>03Resolved [21:06:47] 10Analytics, 06Performance-Team: Explore NavigationTiming by faceted properties - https://phabricator.wikimedia.org/T166414#3295705 (10Gilles) [22:55:55] HaeB: For the pagetriage tags, it looks like all tags related to the user who created the articles (the ones that start with "user_") are updated every 24 hours by a cron script. [23:03:13] HaeB: Article-related tags (most everything else) are updated on page saves [23:16:06] HaeB: Actually, all PageTriage metadata tags are updated on article edit... [23:16:42] it's just that user-related tags are also updated via cron job in case they change but the article doesn't get edited [23:56:48] 06Analytics-Kanban: Code Review Needed: New data produced on https://analytics.wikimedia.org/datasets/ - https://phabricator.wikimedia.org/T165944#3295858 (10GoranSMilovanovic) @Nuria Thanks for the intro to reportupdater; I know about it, but I haven't studied it yet. Now is the time. Thanks again, and next tim...