searching for specialized patterns using grep in a json file

Question

I wonder how can I only grep the "created_at": ones that are followed by }, and a new line like below:

        "hashtags": [],
        "urls": []
    },
    "created_at": "Wed Oct 19 22:19:42 +0000 2016",
    "retweeted": false,
    "coordinates": null,
    "in_reply_to_user_id_str": null,
    "source": "<a href=\"http://tweetlogix.com\" rel=\"nofollow\">Tweetlogix</a>",
    "in_reply_to_status_id_str": null,
    "in_reply_to_screen_name": null,
    "in_reply_to_user_id": null,
    "place": null,
    "retweet_count": 0,
    "id_str": "788867246953201664"
},
{
    "favorited": false,
    "contributors": null,
    "truncated": false,
    "text": "Reddit Exposes Hillary Clinton Staff Trying To Frame Assange As \u2018Pedo\u2019 https://t.co/KNj14p8QqN via @yournewswire",
    "possibly_sensitive": false,
    "is_quote_status": false,
    "in_reply_to_status_id": null,
    "user": {
        "follow_request_sent": false,
        "has_extended_profile": false,
        "profile_use_background_image": true,
        "time_zone": "Eastern Time (US & Canada)",

Initially, I was using grep -wirnE 'Wed Oct 19 2(1:[0-5][0-9]:[0-5][0-9]|2:([0-2][0-9]:[0-5][0-9]|30:00)) .* 2016' * > results_created_at and then using wc -l results_created_at to count the number of tweets that were created in that specific time range. However, turns out, we could have profile images or users which were also created in that time range. So, I would like to know how to only search for tweets using the initial grep command I had?

I have been looking at many of the tweets in my files and seems in all of which, }, \n (newlines) is followed by "created_at": and then a few lines after we have the text.

Maybe you should use `jq` (package `jq`) instead of `grep`. See https://stedolan.github.io/jq/manual/ — Florian Diesch, Jan 15 '18 at 21:26
If they're all indented the same, you could maybe match the amount of whitespace at the head of the line. — wjandrea, Jan 15 '18 at 23:40

Dude Random21 · Answer 1 · 2018-01-16T16:48:01.523

Adding -z to your grep options will make grep treat newlines as null terminating characters (\0) as opposed to separate lines however they do not seem to be matchable in the regex. The workaround for this is to simply match everything (.*) up until the end of your desired pattern (in your case "created_at").

Next you can add -o to have grep only output what is actually matched, otherwise it outputs the whole file (since it is now essentially one giant line). Alternatively if the only purpose of outputting to a file is to later wc -l I would instead suggest you use grep's -c option which will print the number of matches rather than the match itself.

This translates to the following command:

grep -wirnEzc '},.*created_at' *

Expanding on this to include your previous pattern as well we get:

grep -wirnEzc '},.*created_at":\s"Wed Oct 19 2(1:[0-5][0-9]:[0-5][0-9]|2:([0-2][0-9]:[0-5][0-9]|30:00)) .* 2016' *

My apologies I thought you were looking for a replacement to your original pattern I have updated my answer to include your original pattern. — Dude Random21, Jan 16 '18 at 16:49

searching for specialized patterns using grep in a json file

1 Answers1

Linked