7

I'm trying to extract information from a JSON file and write some of the contents to a CSV file.

Here is an example of my text

"data":{"headers":{"sender":"frank@abc.com"
"to":"jim@def.com"
"subject":"Help with this project"
"x-received-time":"14144273245408"
"received":"from abc.com ()\r\n        by mail.mail.com with SMTP (Postfix)\r\n        for jim@def.com;\r\n        Mon
"from":"\"Help with this project\" <frank@abc.com>"
"date":"Mon, 27 Oct 2014 09:03:14 -0500"
"id":"1414427328-2345855-frank"
"to":"jim@def.com"
"time":14144273245408
"subject":"Help with this project"
"fromfull":"frank@abc.com"

I want to grab the contents from: to, fromfull, id, subject, date and write it to a csv file where To is column A, fromfull is column B, and so forth.

Can anyone offer any assistance? This is a JSON response.

Sylvain Pineau
  • 61,564
  • 18
  • 149
  • 183
BeMy Friend
  • 1,013
  • 5
  • 14
  • 19
  • Your content looks like JSON. Consider looking for ways to convert JSON to CSV. – muru Oct 27 '14 at 18:05
  • Will the data always be laid out the way you've shown it? Or did you simply do that for clarity purposes? – Chuck R Oct 27 '14 at 18:21
  • This is always the layout. There is more information within the file than just this, but it will always be laid out pretty much identical to this. Prior to this layout, I ran a SED to break everything out into multiple lines. So if I extract the data after that SED, then the layout should always look like this. – BeMy Friend Oct 27 '14 at 18:26
  • try piping your output through json_pp, less work. – ppetraki Oct 27 '14 at 19:26
  • 3
    Check out [jq](http://stedolan.github.io/jq/). This will be as simple as something like: `jq '. | @csv' myfile.json`. – David Oct 27 '14 at 22:42
  • All the answers are missing on the fact that JSON is unordered, so it's perfectly valid to have `[{"name":"foo", age:21}, {age:23, "name":"bar"}]` as JSON for instance. Except the Node one that is. – Benjamin Gruenbaum Oct 28 '14 at 10:32
  • I added a complete answer for creating the CSV using `jq`, as mentioned by @David . My solution also deals with the key ordering issue raised by @BenjaminGruenbaum . – Joe Harris Nov 11 '14 at 18:49

6 Answers6

16

You can convert this JSON to CSV in a single line with jq.

jq '.data.headers | [.sender, .to, .subject, ."x-received-time", 
.received, .from, .date, .id, .to, .subject, .fromfull] 
+ [(.time | tostring)] | join(", ")'

Breakdown:

  • .data.headers - Emit headers as an object
    • If data contained an array of headers it would be .data[].headers
  • […string keys list…] - Emit string values as an array
  • + [(.time | tostring)] - Emit time as a string and add to the array
  • join(", ") - Join the array values using a comma and a space
    • Substitute your favorite delimiter here

Update 2022:

jq supports @csv (Comma Separated Value) or @tsv (Tab Separated Value) formatter. The same above code can be written as:

jq -r '.data.headers | [.sender, .to, .subject, ."x-received-time", 
.received, .from, .date, .id, .to, .subject, .fromfull] 
+ [(.time | tostring)] | @csv'
Abdul Munim
  • 113
  • 5
Joe Harris
  • 261
  • 1
  • 4
  • 1
    You have to parse the output to `| sed 's/"//g'` to get rid of the quotes at the end – rubo77 Nov 19 '16 at 03:00
  • 2
    Nice tool to develop this: https://jqplay.org – rubo77 Nov 19 '16 at 08:05
  • jq also has a dedicated @csv filter, very nicely described here: https://stackoverflow.com/questions/32960857/how-to-convert-arbirtrary-simple-json-to-csv-using-jq – tjb Dec 17 '18 at 09:34
  • actualy, without the "-r" parameter, JQ will create a single cell with these values. You want to use "jq -r ... > data.csv" to output Comma Seperated values correctly. This way you can also handle newline characters "\n" – Sebastien H. Jul 31 '20 at 12:11
7

You can use the following perl command to create the CSV output, open a terminal and type:

perl -n0e '@a= $_ =~ /"date":(".*?").*?"id":(".*?").*?"to":"(.*?)".*?".*?"subject":(".*?").*?"fromfull":"(.*?)"/gs;  while (my @next_n = splice @a, 0, 5) { print join(q{,}, @next_n)."\n"}' inputfile.txt

It will work even if you have multiple headers in your input file.

Note that only the last "to": field is taken into account (it seems that your headers provide the info twice)

The command output:

"Mon, 27 Oct 2014 09:03:14 -0500","1414427328-2345855-frank",jim@def.com,"Help with this project",frank@abc.com
Sylvain Pineau
  • 61,564
  • 18
  • 149
  • 183
  • This is what I'm looking for, but I need to add a few more fields as what you posted accounts for To and FromFull, but not ID, Subject, or Date. It looks like you are adding \"to\":\"(.*?)\".*? to indicate a new structure, but I can't pick out the exact syntax to add additional fields. THANK YOU so much, as I don't know PERL much at all. – BeMy Friend Oct 27 '14 at 18:43
  • As indicated in the original post: to, fromfull, id, subject, date – BeMy Friend Oct 27 '14 at 19:05
  • I basically just need to create a spreadsheet with basic topical information from emails. – BeMy Friend Oct 27 '14 at 19:06
  • @BeMyFriend: Sorry I missed your edit, please check the new command – Sylvain Pineau Oct 27 '14 at 19:13
6

Since you are working with JSON files, why not parse it as such? Install nodejs-legacy and create a NodeJS script such as:

#!/usr/bin/env node
// parseline.js process lines one by one
'use strict';
var readline = require('readline');
var rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout,
  terminal: false
});

rl.on('line', function(line){
    var obj = JSON.parse(line);
    // add the fields which you want to extract here:
    var fields = [
        obj.data.headers.to,
        obj.data.headers.subject,
        // etc.
    ];
    // print the fields, joined by a comma (CSV, duh.)
    // No escaping is done, so if the subject contains ',',
    // then you need additional post-processing.
    console.log(fields.join(','));
});

Assuming you have a valid JSON string on each line of a file:

node parseline.js < some.txt

Or if you really want to read a single file and parse fields from that:

#!/usr/bin/env node
// parsefile.js - fully read file and parse some data out of it
'use strict';
var filename = process.argv[1]; // first argument
var fs = require('fs');
var text = fs.readFileSync(filename).toString();
var obj = JSON.parse(text);
// add the fields which you want to extract here:
var fields = [
    obj.data.headers.to,
    obj.data.headers.subject,
    // etc.
];
// print the fields, joined by a comma (CSV, duh.)
// No escaping is done, so if the subject contains ',',
// then you need additional post-processing.
console.log(fields.join(','));

Then run it with:

node parsefile.js yourfile.json > yourfile.csv
terdon
  • 98,183
  • 15
  • 197
  • 293
Lekensteyn
  • 171,743
  • 65
  • 311
  • 401
  • If you have a shabang in place can't you just `./parseline.js < some.txt`? – Benjamin Gruenbaum Oct 28 '14 at 10:33
  • @BenjaminGruenbaum Yes you can, but then you first need to `chmod +x parseline.js` first. – Lekensteyn Oct 28 '14 at 15:04
  • to be fair I was pointing that out as a question out of politeness, I thought it could be a good edit for your question but you might disagree, good answer nontheless. – Benjamin Gruenbaum Oct 28 '14 at 15:45
  • @BenjaminGruenbaum I deliberately omitted it to avoid explaining the need to add the executable bit (which is not related to the post) and to clarify which program is responsible for it. You can of course make it executable (recommended to save typing!) or put it in `$PATH` as needed. – Lekensteyn Oct 28 '14 at 15:50
3

You can use jsonv from GitHub

And then the following command:

cat YOUR_JSON_FILEname | jsonv to,fromfull,id,subject,date > output.csv
Edward Moffett
  • 161
  • 1
  • 8
1

Here is an awk implementation:

   awk -F ":" '{gsub("\"","",$1);key=$1;sub(key " ","");gsub("\\","",$0);value[key]=$0; if ("fromfull"== key) print value["from"] ";" value["to"] ";" value["fromfull"] ";" value["id"] ";" value["subject"] ";" value["date"] ;}' jsonFile > csvFile

This script read line until found "fromfull" line, than print csv line, so it should works also with multiple sequences.

This is the result:

  ""Help with this project" <frank@abc.com>";"jim@def.com";"frank@abc.com";"1414427328-2345855-frank";"Help with this project";"Mon, 27 Oct 2014 09 03 14 -0500"
Lety
  • 5,994
  • 2
  • 28
  • 36
  • This was really close and worked for the most part. Just the perl implementation came out slightly better formatted. Thanks though! – BeMy Friend Oct 27 '14 at 19:18
  • You are welcome. I update my answer, but @SilvainPineau solution is better, also because awk behaviour on `\\\` is strange due to this [bug](https://bugs.launchpad.net/ubuntu/+source/mawk/+bug/102208). – Lety Oct 27 '14 at 22:20
1

Here's a gawk script I just whipped up for you!

#!/usr/bin/gawk -f
BEGIN {
  FS="\""
  output=""
  nodata=1
}

/^"data"/{
  if( ! nodata )
  {
    gsub("|$","",output)
    print output
    nodata=0
  }
  output=""
}

/^"[^d][^a][^t][^a]/{
  if ( $2 == "to" || $2 == "fromfull" || $2 == "id" || $2 == "subject" || $2 == "date" )
    output=output$4"|"
}

END{
  gsub("|$","",output)
  print output
}

It should work on a file with a bunch of like entries. If you want to add other items to the list, just add them in the if statement. I did find one problem with your data set though: the dates. They contain commas so it can't be a true CSV. Instead I just separated it with another character.

Chuck R
  • 4,848
  • 2
  • 26
  • 37