[06:08:29] !log upgrade python-kafka on eventlog1002 to 1.4.7-1 (manually via dpkg -i) [06:08:31] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [06:08:32] morning! [06:18:17] eventlogging seems running fine with the new kafka-python [06:18:25] I really hope that this time is the right one [06:18:35] if so it would be a big relief [06:20:00] 10Analytics, 10Analytics-Kanban, 10Performance-Team (Radar), 10User-Elukey: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 (10elukey) a:03elukey [06:20:54] 10Analytics, 10Analytics-Kanban: Migrate eventlogging to python3 - https://phabricator.wikimedia.org/T234593 (10elukey) [06:20:59] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Upgrade eventlogging to Python 3 - https://phabricator.wikimedia.org/T233231 (10elukey) [07:10:21] Hi team - reminder that I'm off today and tomorrow [07:24:12] o/ [07:24:19] * elukey sends wikilove to joal [09:00:07] what about https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/518210/ ? is that still needed? it was originally meant to track down some issues with opencv, but for the current stat hosts with the GPU it's probably no longer a thing? [09:01:42] elukey@stat1005:~$ /usr/sbin/radeontop [09:01:42] Cannot access GPU registers, are you root? [09:02:12] we have metrics now in prometheus in theory, but they are not as precise as radeontop probably for these things [09:03:32] yeah, but the question is whether this is necessary for gpu-testers group or whether it's enough if you can use radeontop to troubleshoot [09:05:00] no idea, but radeontop is surely a good hel [09:05:02] *help [09:05:57] there is also /opt/rocm/bin/rocm-smi [09:05:58] I'm fine either way, just wondering whether to discard or merge :-) [09:06:36] I would merge it so users will have some usable tool [09:12:54] ok [10:17:29] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10Performance-Team (Radar), 10User-Elukey: Eventlogging processors are frequently failing heartbeats causing consumer group rebalances - https://phabricator.wikimedia.org/T222941 (10elukey) Metrics to check for the next ~24h: * https://grafana.wikimed... [10:22:56] 10Analytics, 10Performance-Team: Upgrade python-kafka to 1.4.7 - https://phabricator.wikimedia.org/T234808 (10elukey) p:05Triage→03Normal [10:24:46] 10Analytics, 10Performance-Team: Upgrade python-kafka to 1.4.7 - https://phabricator.wikimedia.org/T234808 (10elukey) [10:47:07] hi a-team, I received an email asking for our search query logs (https://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/). I see that that dump was not finaled released but I found this one: https://dumps.wikimedia.org/other/cirrussearch , is that equivalent ? [10:57:19] dsaez: o/ no idea, probably you'll need to wait for Andrew/Dan (Joseph is out today) [10:57:28] but maybe dcausse has some moar info? :) [10:58:10] (/me lunch!) [10:59:12] ook [10:59:35] thx elukey [11:12:33] a-team the mediarequests endpoints are live :) [11:12:33] https://wikimedia.org/api/rest_v1/metrics/mediarequests/aggregate/all-referers/all-media-types/all-agents/daily/20190501/20191001 [11:38:05] (03PS3) 10Fdans: Add mediarequests tops metric endpoint [analytics/aqs] - 10https://gerrit.wikimedia.org/r/540433 [11:53:47] dsaez: search query logs not are not public, it was released few hours (blog post you mention) but rapidly taken down, see the note at the end of this same post [11:54:15] it was took down because search queries may contain private information [11:54:43] dcausse, thx, and the cirusssearch dumps, are equivalent or this is completely different story? [11:55:05] https://dumps.wikimedia.org/other/cirrussearch is the dump of the search content we index, in other words the wiki page content + the metadata we index [11:55:49] dsaez: if you to inspect search queries you need to do this on the analytics cluster [11:58:39] dcausse, oh, I see, this was for an external researcher asking for the dataset. He want to propose a research project around the queries. I suggested him to contact Guillaume to see if there are some common interest there,, is Guillaume the right persont to be contacted? [12:00:13] dsaez: the researcher will have to ask for a NDA this data is not available publicly, yes Guillaume might be the right person even if we don't have much experience dealing with external researchers [12:01:19] dcausse, great, let see if there some common interest there. [12:01:27] sure [12:55:24] thanks dcausse! [12:56:02] 10Analytics, 10Performance-Team: Upgrade python-kafka to 1.4.7 - https://phabricator.wikimedia.org/T234808 (10elukey) ` root@install1002:/srv/wikimedia# reprepro lsbycomponent python-kafka python-kafka | 1.4.3-1~jessie1 | jessie-wikimedia | main | amd64, i386, source python-kafka | 1.4.7-1 | stretch-w... [12:57:20] dsaez: https://blog.wikimedia.org/2012/09/19/what-are-readers-looking-for-wikipedia-search-data-now-available/ [12:57:21] :p [12:59:19] OH sorry [12:59:22] shoulda scrolled up farther [12:59:27] :p you found that already [13:17:35] 10Analytics, 10Analytics-Kanban, 10User-Elukey: Shorten the time it takes to move files from hadoop to dump hosts by Kerberizing/hadooping the dump hosts - https://phabricator.wikimedia.org/T234229 (10elukey) [13:17:37] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Allow all Analytics tools to work with Kerberos auth - https://phabricator.wikimedia.org/T226698 (10elukey) [13:23:30] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade eventlogging to Python 3 - https://phabricator.wikimedia.org/T233231 (10elukey) [13:23:40] fdans: missed your announcement, congrats :) \o/ [13:24:39] 10Analytics, 10Analytics-Kanban: Eventlogging to druid daily timers not executing? - https://phabricator.wikimedia.org/T234494 (10elukey) @Nuria should we do something for this or should we close? [13:29:58] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade eventlogging to Python 3 - https://phabricator.wikimedia.org/T233231 (10Ottomata) Luca, if you have trouble migrating eventlogging/service.py stuff, we can probably just remove it. We've removed all eventlogging-service-eventbus i... [13:37:09] 10Analytics, 10DBA: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) [13:44:27] 10Analytics, 10DBA: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Marostegui) I am fine with this plan. I assume this service will be still owned and maintained by Analytics, right? (Of course we can help with the setup and all that as we normally do). What I w... [13:50:14] 10Analytics, 10DBA: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) >>! In T234826#5552110, @Marostegui wrote: > I am fine with this plan. I assume this service will be still owned and maintained by Analytics, right? (Of course we can help with the setup... [13:51:45] 10Analytics, 10DBA: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Marostegui) >>! In T234826#5552125, @elukey wrote: >>>! In T234826#5552110, @Marostegui wrote: >> I am fine with this plan. I assume this service will be still owned and maintained by Analytics,... [13:55:02] 10Analytics, 10DBA: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10elukey) >>! In T234826#5552128, @Marostegui wrote: >>>> Important note about the log database: the plan is to take a full snapshot of the db and archive it in HDFS before starting any procedure.... [13:56:34] hey teammm [13:57:35] o/ [14:01:57] 10Analytics, 10DBA: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Marostegui) Sure, I just wanted to make sure expectations for the users will be handled beforehand :) [14:04:35] (03CR) 10Fdans: [V: 03+1] "Both monthly and daily jobs have been correctly retested with the latest changes applied." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/538880 (https://phabricator.wikimedia.org/T233717) (owner: 10Fdans) [14:13:22] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Upgrade eventlogging to Python 3 - https://phabricator.wikimedia.org/T233231 (10elukey) >>! In T233231#5552065, @Ottomata wrote: > Luca, if you have trouble migrating eventlogging/service.py stuff, we can probably just remove it. We've re... [15:01:29] dsaez, dcausse : but we do not normally grant access to data for a researcher doing a project on its won that is also not part of WMF projects. we just can support so many users [15:01:43] nuria, got you. [15:01:46] nuria: sure [15:03:43] dsaez, dcausse : so you know we get about a request every couple of weeks of reserachers looking for data for their phd, reserach .. etc [15:04:05] last year we investigated collaborating with a research lab but realized that we may lack bandwidth to properly support the project [15:04:23] nuria: I see [15:04:51] dcausse: right, that is commonly the problem [15:16:54] a-team https://usercontent.irccloud-cdn.com/file/avHGrkmZ/Screen%20Shot%202019-10-07%20at%205.16.38%20PM.png [15:22:45] 10Analytics: Superset not able to load a reading dashboard - https://phabricator.wikimedia.org/T234684 (10elukey) >>! In T234684#5548659, @Nuria wrote: > I see errors reaching to 1001 but i thought it was 1003 the one that answered to superset? Yep the broker used is on 1003, but the errors refer to druid1001:... [15:25:44] PROBLEM - Check the last execution of reportupdater-published_cx2_translations on stat1006 is CRITICAL: CRITICAL: Status of the systemd unit reportupdater-published_cx2_translations https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [15:27:24] 10Analytics, 10DBA: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Milimetric) p:05Triage→03High [15:27:32] 10Analytics, 10DBA: Repurpose db1108 as generic Analytics db replica - https://phabricator.wikimedia.org/T234826 (10Milimetric) p:05High→03Normal [15:28:16] 10Analytics, 10Analytics-Kanban: Superset not able to load a reading dashboard - https://phabricator.wikimedia.org/T234684 (10Milimetric) p:05Triage→03High a:03Milimetric [15:29:19] 10Analytics, 10Analytics-Kanban: Upgrade matomo to its latest upstream version - https://phabricator.wikimedia.org/T234607 (10Milimetric) p:05Triage→03High [15:29:27] 10Analytics, 10Analytics-Kanban: Superset not able to load a reading dashboard - https://phabricator.wikimedia.org/T234684 (10Milimetric) a:05Milimetric→03Nuria [15:33:42] 10Analytics, 10Analytics-EventLogging, 10Better Use Of Data, 10EventBus, and 2 others: Eventlogging Client Side can use the stream config module to dynamically adjust sampling rates - https://phabricator.wikimedia.org/T234594 (10Milimetric) p:05Triage→03High [17:12:49] ottomata: going off now, python-kafka seems to behave for the moment.. in case of rollback, 1.4.1 is in my home dir on eventlog1002 [17:13:05] o/ [17:13:24] great ojk! [17:19:25] nuria, milimetric, this is what I could compile re. our discussions about data quality (phabricator, google doc, my notes): [17:20:40] - We discarded prometheus because it does not support timestamps. As our readings will be measured with a couple hours of delay (after refine etc..) [17:21:13] - We discarded statsd, for the same reason, but as we do not need minutely aggregation, we can skip statsd and push directly to graphite [17:22:31] - We considered graphite as a good option, but then after our discussion on the 27th of june, we dropped it in favor of the dashiki solution. [17:23:07] - I believe it was because graphite is deprecated by ops, likely to be there still for a couple years, but in disuse [17:23:40] PROBLEM - Check the last execution of reportupdater-pingback on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:24:02] - Also, because with graphite, everytime we add a new data quality metric, we'd need to manually configure the corresponding holt_winters alarm [17:24:41] whereas if we did it in spark, we wouldn't need to do it for each new metric [17:24:57] ok, checking RU alarms [17:25:08] PROBLEM - Check the last execution of reportupdater-browser on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:26:18] here i thought my network was flaky, stat1007 seems to have stalled out. ssh connections are hung, new connections aren't being made [17:26:38] PROBLEM - Check the last execution of archive-maxmind-geoip-database on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:31:48] PROBLEM - Check the last execution of refinery-import-page-history-dumps on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:32:10] PROBLEM - Check the last execution of reportupdater-interlanguage on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:32:29] ebernhardson: ya stat1007 is having issues [17:32:44] needs cgroups to limit people taking all the memory and killing core services :P [17:32:53] (i dunno how to actualy implement that...) [17:33:00] ebernhardson: could not agree nore myself [17:33:03] *more [17:33:12] ebernhardson: we could start with ulimits [17:34:09] ping ottomata on stat1007 going kaput [17:39:32] ah sorry am lucnhing [17:39:34] hmm [17:39:37] somebody needs to user the hammer [17:40:16] looking [17:40:42] PROBLEM - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [17:46:03] !log powercycling stat1007 [17:46:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:46:59] ottomata: can we possibly look at the process most recently executed? [17:48:12] ebernhardson: question about the old times [17:48:29] ebernhardson: the possible domains of wikipedia portal are www.wikipedia.org and anything else? [17:49:38] nuria: hmm, i think that's it [17:49:44] ebernhardson: k [17:53:04] RECOVERY - Check the last execution of refinery-import-page-history-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:53:26] RECOVERY - Check the last execution of reportupdater-interlanguage on stat1007 is OK: OK: Status of the systemd unit reportupdater-interlanguage https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:54:06] def some OOM error, last syslogs are about nrpe not being able to allocate mem [17:55:32] RECOVERY - Check the last execution of reportupdater-pingback on stat1007 is OK: OK: Status of the systemd unit reportupdater-pingback https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:57:00] RECOVERY - Check the last execution of reportupdater-browser on stat1007 is OK: OK: Status of the systemd unit reportupdater-browser https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:57:22] RECOVERY - Check the last execution of reportupdater-published_cx2_translations on stat1007 is OK: OK: Status of the systemd unit reportupdater-published_cx2_translations https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [17:58:32] RECOVERY - Check the last execution of archive-maxmind-geoip-database on stat1007 is OK: OK: Status of the systemd unit archive-maxmind-geoip-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:07:55] (03PS1) 10Nuria: Adding entry so www.wikipedia.org data is refined [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/541336 (https://phabricator.wikimedia.org/T234461) [18:09:40] PROBLEM - Check the last execution of reportupdater-published_cx2_translations on stat1007 is CRITICAL: CRITICAL: Status of the systemd unit reportupdater-published_cx2_translations https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:11:20] RECOVERY - Check if the Hadoop HDFS Fuse mountpoint is readable on stat1007 is OK: OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration%23Fixing_HDFS_mount_at_/mnt/hdfs [18:15:20] (03CR) 10Mforns: [V: 03+2 C: 03+2] "LGTM!" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/541336 (https://phabricator.wikimedia.org/T234461) (owner: 10Nuria) [18:21:14] nuria, milimetric, do the notes above re. data quality make sense, can I continue with the Scala-Spark job to generate Dashiki report files, or should we rediscuss? [18:22:16] makes sense to me, and it’s fairly easy to change where we do the dashboard so I’d say go ahead but I wasn’t there in the first place [18:26:21] milimetric, yes, you were no? the first discussion was 27th of June [18:29:47] oh I totally don’t remember that :) but must’ve been yeah [18:48:46] mforns: where does data harvested by promtheus goes now, is graphite or something else? [18:52:53] nuria, prometheus is it's own thing [18:53:14] nuria, IIUC grafana reads from prometheus [18:53:39] but the data is stored by prometheus itself [18:55:15] GoranSM: I see that you are executing a process in stat1007consuming quite a lot of memory, could it be run with ionice/nice? that might not impact memory that much but it will reserve cpu cycles for others [19:06:23] quick question, there is table with the number of revisions per user? or is this somewhere in the data lake? [19:13:47] dsaez, the mediawiki_history table has a field named event_user_revision_count that might be useful [19:14:24] thanks mforns, I saw that, but I'm not sure what this means: (only available in revision-create events so far) [19:14:33] however, you still would have to select the pool of users you want to analyze [19:14:34] not sure what is a revision-create event :D [19:14:48] a revision create is basically an edit [19:14:54] ooh, ok. [19:15:19] thanks! [19:15:59] you could maybe select the last revision create event per user? [19:16:06] no problemo [19:16:11] :] [19:22:54] yep, I think that would be faster that counting each revision [19:57:52] 10Analytics, 10Analytics-Kanban, 10Performance-Team (Radar): Upgrade python-kafka to 1.4.7 - https://phabricator.wikimedia.org/T234808 (10Gilles) [19:58:18] milimetric: any tips on getting mw-vagrant mediawiki set up to send somehting with eventlogging? [19:58:29] maybe with WikimediaEvents extension? [20:20:30] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Remove the HelpPanel schema from the EventLogging whitelist - https://phabricator.wikimedia.org/T234855 (10nettrom_WMF) [20:23:40] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Remove the HelpPanel schema from the EventLogging whitelist - https://phabricator.wikimedia.org/T234855 (10nettrom_WMF) [20:24:09] 10Analytics, 10Product-Analytics, 10Growth-Team (Current Sprint): Remove the HelpPanel schema from the EventLogging whitelist - https://phabricator.wikimedia.org/T234855 (10nettrom_WMF) p:05Triage→03High [21:01:22] ottomata: a mouse over (popup) should always sent an event [21:01:28] ottomata: did you try that one? [21:01:52] nuria: is that in core? [21:01:55] or do I need an extension? [21:02:00] ottomata: core [21:02:07] ottomata: at least i think [21:03:10] 10Analytics, 10Analytics-Cluster, 10DC-Ops, 10Operations, 10ops-eqiad: analytics1045 - RAID failure and /var/lib/hadoop/data/j can't be mounted - https://phabricator.wikimedia.org/T232069 (10wiki_willy) Thanks @elukey . Should we ignore/resolve this alert then? Thanks, Willy [21:05:57] ottomata:also i have (in teh past) added NavigationTiming extension and set sampling factor to 1 [21:06:47] ottomata: $wgNavigationTimingSamplingFactor = 1; in a/LocalSettings.php b/LocalSettings.php [21:11:10] oh great i see a vagrant role for that [21:11:11] ty [21:32:30] (03CR) 10Nuria: "One nit in comments but I think is ready" (032 comments) [analytics/aqs] - 10https://gerrit.wikimedia.org/r/540433 (owner: 10Fdans) [22:26:25] PROBLEM - Check the last execution of refinery-import-page-history-dumps on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:26:33] PROBLEM - Check the last execution of archive-maxmind-geoip-database on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:26:49] PROBLEM - Check the last execution of reportupdater-interlanguage on stat1007 is CRITICAL: connect to address 10.64.21.118 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:37:01] RECOVERY - Check the last execution of refinery-import-page-history-dumps on stat1007 is OK: OK: Status of the systemd unit refinery-import-page-history-dumps https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:37:07] RECOVERY - Check the last execution of archive-maxmind-geoip-database on stat1007 is OK: OK: Status of the systemd unit archive-maxmind-geoip-database https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:37:25] RECOVERY - Check the last execution of reportupdater-interlanguage on stat1007 is OK: OK: Status of the systemd unit reportupdater-interlanguage https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:40:02] stat1007 went kaput again but looks like it resurrected