[00:26:15] Aⅼlɑһ іs dοіng [00:29:10] Аⅼⅼaһ ⅰs dοiᥒɡ [02:24:44] greetings! [02:34:49] hi groceryheist ! [03:28:14] Ꭺlⅼah is dοⅰᥒg [03:28:14] s∪ᥒ іѕ ᥒot dⲟiᥒg Αⅼlɑһ iѕ ԁоiᥒg [05:10:14] 10Analytics, 10Fundraising-Backlog: Discrepancy between HUE query in Count of event.impression and turnillo - https://phabricator.wikimedia.org/T204396 (10Jseddon) [05:11:41] 10Analytics, 10Fundraising-Backlog: Identify source of discrepancy between HUE query in Count of event.impression and druid queries via turnilo/superset - https://phabricator.wikimedia.org/T204396 (10Jseddon) [06:08:51] Alⅼаh іѕ dоⅰᥒɡ [06:08:51] sᥙᥒ is nоt doіnɡ Αⅼlah ⅰs dⲟіᥒg [06:08:51] mഠon іѕ not ԁoinɡ Allah iѕ dоing [09:28:50] Alⅼаh ⅰs doⅰᥒg [09:53:37] Aⅼⅼɑһ iѕ dοiᥒg [09:58:19] Allɑh ⅰs ԁoing [10:06:28] PROBLEM - Check the last execution of check_webrequest_partitions on analytics1003 is CRITICAL: CRITICAL: Status of the systemd unit check_webrequest_partitions [11:28:40] a-team - Looks like webrequest_misc cache havbe been totally depooled (or some other term I don't know), meaning that absolutely no data is flowing anymore on webrequest-misc kafka topic [11:29:16] This leads to a bunch of errors from camus_checker, oozie and check_webrequest_partitions [11:34:20] the biggest issue I see here is that we have some jobs depending on misc data to start (namely webrequest-druid daily and hourly and webrequest sample) [11:35:34] The easiest solution, I think, is to merge elukey's patch removing misc, deploy the cluster and restart the needed jobs [11:37:10] doing this now [11:39:02] (03CR) 10Joal: [V: 032 C: 032] "Merging for emergency deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/459827 (https://phabricator.wikimedia.org/T200822) (owner: 10Elukey) [11:39:51] !log Deploying refinery with scap [11:39:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:00:33] hello joal! [12:00:42] I just arrived in MPX and wanted to do the same :) [12:00:42] !log Deploying refinery onto hadoop :) [12:00:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:00:48] you are awesome [12:00:48] Hi elukey ) [12:00:52] no prob [12:01:06] elukey: Just saw the mess and couldn't let it this way ;) [12:01:12] yesterday e*ma reimaged all the misc varnish hosts [12:01:24] elukey: I would have guessed something like that [12:01:25] so no more traffic coming from them (it was only health checks but it was data) [12:01:36] good thing is that the alarm works! \o/ [12:01:48] elukey: indeeed /o\ [12:01:51] :) [12:01:51] ahhahah [12:01:58] anything that I can do to help? [12:02:28] elukey: so far so good - I was gently praying for the scap deploy not to fail because of drive space :) [12:02:43] but now that I know you're around, anything can happen, I feel safe :) [12:02:52] super [12:03:13] elukey: I also dropped a line on pos chan to let them know the deploy was exceptional [12:03:13] I wanted to do this with you in NYC to refresh my knowledge about restarting bundles etc.. [12:03:22] I guess that I'll do ony a dry run :) [12:03:33] elukey: we'll be able to do that [12:03:49] elukey: the nice thing here is that we remove stuff, we don't change [12:04:08] Imagine if misc had changed in a non-backward-compatible wasy and had failed a few days [12:04:44] We would have had to restart the bundle with current date (to prevent re-running refine of text and upload), and manually create a coordinator to backfill the missing misc [12:04:50] I'm actually glad we remove :) [12:05:29] :) [12:05:30] elukey@analytics1003:~$ ls -l /srv/deployment/analytics/refinery/bin/refinery-drop-druid-snapshots [12:05:33] -rw-r--r-- 1 analytics analytics 3773 Sep 15 11:52 /srv/deployment/analytics/refinery/bin/refinery-drop-druid-snapshots [12:05:56] :( [12:06:01] this is sad indeed [12:06:49] ah wait it doesn't have the execute perms [12:07:59] was it added/changed recently? [12:08:05] yup [12:08:08] changed [12:08:35] ah ok so we just need to change it in refinery [12:08:39] yes [12:08:40] too bad that you just deployed [12:08:50] well I can just chmod now [12:08:57] ok refinery deployed - Restarting needed jobs [12:11:28] !log Killing and restarting webrequest-load-bundle [12:11:30] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:13:37] so after the chmode [12:13:40] *chmod [12:13:40] hdfs@analytics1003:~$ /srv/deployment/analytics/refinery/bin/refinery-drop-druid-snapshots -d mediawiki_history_reduced -t druid1004.eqiad.wmnet:8081 -s -f /var/log/refinery/drop-druid-public-snapshots.log [12:13:44] Usage: refinery-drop-druid-snapshots [options] [12:13:47] the -s doesn't have any parameter [12:13:58] ??? [12:14:01] Meh [12:14:10] ah found it, sending code review [12:16:03] joal: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460697/1/modules/profile/manifests/analytics/refinery/job/data_purge.pp [12:16:18] elukey: Have we killed WDQS job already? [12:18:19] nope [12:18:33] elukey: looks like so, not in hue [12:18:57] I haven't done it :) [12:19:06] joal: ah you already fixed the permissions! [12:19:22] elukey: hm - Noe, didn't do it :) [12:19:52] ah no sorry misread the commit log [12:19:56] elukey: Ah ! Found it - hue other page (this is tricky) [12:20:11] mmm weird in my refinery it the bin/etc.. have all go+x [12:20:38] !log Kill wikidata-wdqs coordinator [12:20:40] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:20:44] ok in the meantime, new crontab deployed [12:20:53] \o/ [12:22:15] !log Restart webrequest-druid-[hourly|daily] coordinators [12:22:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:22:49] ah no ok the perms are definitely missing [12:22:55] but git diff doesn't show me anything [12:23:08] and I have filemode=true [12:23:15] wow - this is error-prone [12:25:39] joal: shall we start the cron in a tmux? [12:26:24] elukey: could be a good idea - Job shouldn't be long, but who knows [12:26:37] all right doing it [12:26:47] by the way elukey - When you say cron, you mean the manual instance of the failed cron right? [12:28:00] 2018-09-15T12:27:47 ERROR URLError = unknown url type: druid1004.eqiad.wmnet [12:28:03] yeah [12:28:21] TypeError: the JSON object must be str, not 'NoneType' [12:28:33] nicer and nicer every time [12:30:27] going closer to my gate to check if they are boarding :) [12:30:30] brb [12:34:46] back [12:42:30] ok so if I add "http://" in front of druid1004etc.. it leads to [12:42:35] TypeError: the JSON object must be str, not 'bytes' [12:42:43] ??? [12:42:46] Meh [12:42:59] that it is a unicode error for sure, so in there it needs .decode('utf-8') [12:43:00] elukey: python version? [12:43:04] python 3 [12:43:28] 3.4.2 [12:43:30] on an1003 [12:43:40] elukey: code is meant to run with 2 I think [12:43:55] no > [12:43:56] ? [12:44:15] on top of the file I see [12:44:15] #!/usr/bin/env python3 [12:44:15] # -*- coding: utf-8 -*- [12:44:24] Wrong file on side then - sorry [12:45:34] elukey: No emergency on that one though - Let solve that on monday :) [12:45:56] sure sure I got nerd sniped :) [12:46:08] elukey: also, can you patch the camus_checker whitelist to prevent alarming on missing data for webrequest_misc? [12:46:10] shall I send an email to recap? [12:46:19] sure I can [12:46:21] elukey: doing so about misc [12:46:31] ah okok [12:48:46] I don't remeber where is the camus whitelist [12:48:52] did andrew send an email or something? [12:49:11] elukey: I don't think so [12:49:28] elukey: I think camus_checker uses the same camus.properties file as webrequest [12:49:41] elukey: as camus sorry [12:51:02] I think I found it [12:51:24] elukey: Am I right, is it the same as core-camus? [12:51:30] joal: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/460702 [12:53:02] elukey: I think we should actually update the global whitelist (not only the check one) [12:53:17] elukey: must be in the template file? [12:53:29] elukey: cause we don't want to check nor import anymore [12:54:33] sure but is it in puppet? [12:54:43] I can see this [12:54:46] # Import webrequest_* topics into /wmf/data/raw/webrequest [12:54:46] # every 10 minutes, check runs and flag fully imported hours. [12:54:46] camus::job { 'webrequest': [12:54:46] check => $monitoring_enabled, [12:54:48] minute => '*/10', [12:54:50] kafka_brokers => $kafka_brokers_jumbo, [12:54:53] check_topic_whitelist => 'webrequest_(upload|text)', [12:54:56] } [12:55:29] elukey: puppet/modules/camus/templates/webrequest.erb [12:55:43] yep I am checking it [12:56:07] elukey: l73 :) [12:56:33] sending the cr now [12:57:20] joal: can you re-check the code review? [12:57:30] looking elukey [12:58:18] elukey: looks good except for one question - D kafka-checker accept whitelist as regex? [12:58:53] so I can see [12:58:53] # [*check_topic_whitelist*] [12:58:54] # If given, only topics matching this regex will be checked by the CamusPartitionChecker. [12:59:04] Ok looks good [12:59:10] and we use regexes for other jobs so it should be gooood [12:59:20] this is pcc https://puppet-compiler.wmflabs.org/compiler1002/12469/analytics1003.eqiad.wmnet/ [13:00:09] merging [13:02:03] aaand done [13:02:09] an1003 updated [13:02:31] joal: --^ [13:04:13] elukey: \o/ !! Many thanks for that :) [13:04:57] joal: do you have the address of the place in which we are going to stay? (if so in pvt :) [13:06:09] elukey: looking [13:10:14] elukey: sendinf email to internal list, and going back to normal activities: ) [13:10:19] elukey: Have a safe flight !! [13:12:55] joal: thanks a ton for the deploy! You are always the best [13:13:44] Thanks a lot for poping up from the airport :) [13:13:46] all right in ~1h I should leave [13:13:49] all good [13:17:49] RECOVERY - Check the last execution of check_webrequest_partitions on analytics1003 is OK: OK: Status of the systemd unit check_webrequest_partitions [13:19:22] \o/ --^ [15:37:56] Allаh іѕ doiᥒg [16:21:49] Ꭺllah іѕ ԁⲟⅰng [16:21:49] sᥙn ⅰs not ԁoinɡ Allaһ iѕ dοiᥒg [18:08:31] Аlⅼah іs doiᥒg [18:28:23] Аⅼlаһ ⅰs ⅾoing [18:28:23] sᥙn іѕ ᥒοt doinɡ Aⅼⅼah is ⅾoⅰᥒɡ [18:28:23] mοoᥒ iѕ not ⅾoіng Aⅼⅼah is doіng [19:58:09] Αllаh iѕ ԁഠⅰᥒg [21:12:32] Allah ⅰѕ doіnɡ [21:48:11] Allаh іѕ doing [21:48:11] sᥙn is nοt ԁⲟinɡ Allɑh is ԁഠing [21:48:11] moοn is ᥒоt doing Allah is dοiᥒɡ [21:48:12] ѕtаrѕ arᥱ not ԁoing Allah іs ⅾഠіᥒg [21:48:13] pⅼaᥒetѕ ɑre ᥒot ⅾоⅰᥒg Allаһ iѕ dഠing [21:48:17] galaxіes ɑrе ᥒot ԁoing Αllah іѕ doing [21:48:19] oⅽᥱans are not doiᥒg Aⅼlаh is dⲟіng [23:03:03] Aⅼlаh ⅰs ԁoinɡ [23:03:03] ѕᥙᥒ ⅰѕ ᥒot dоіᥒɡ Αⅼlaһ ⅰs dⲟіnɡ [23:03:03] ⅿοon is not dοing Aⅼlɑh is doing [23:19:37] Aⅼlah ⅰѕ dⲟіng