3

I have a json file where I need to remove the last forward slashes only. See the example:

{"url":"http://example.com/vary/file/","originalUrl":"http://example.com/vary/file/","applications":[{.........}]}

I just want the data to look like:

{"url":"example.com/vary/file","originalUrl":"example.com/vary/file","applications":[{.........}]}

How can I do this with sed?

Sergiy Kolodyazhnyy
  • 103,293
  • 19
  • 273
  • 492
Jaffer Wilson
  • 1,548
  • 4
  • 25
  • 35
  • 1
    [Backslash or a (forward) slash?](http://www.cs.ucsb.edu/~pconrad/topics/BackslashVsForwardSlash/) – Melebius Feb 07 '17 at 10:07
  • @Melebius :P it should be forward I guess. – Jaffer Wilson Feb 07 '17 at 10:11
  • 1
    How big is the file, and can double slash blindly be replaced by a single one? – Jacob Vlijm Feb 07 '17 at 11:52
  • @JacobVlijm might happen... the file is large GBs – Jaffer Wilson Feb 07 '17 at 11:59
  • @JacobVlijm nothing as such. I was just trying to use `sed`nothing else. – Jaffer Wilson Feb 07 '17 at 12:11
  • 1
    Your output doesn't have the `http://` from the input. – muru Feb 08 '17 at 04:44
  • @muru oops I forgot to add it. But is it possible to have it without http also? – Jaffer Wilson Feb 08 '17 at 05:18
  • 1
    Removing `http://` should be easy, just another `sed 's,http://,,g'` – muru Feb 08 '17 at 05:30
  • @muru Yes I think that's easy. I think that's why I haven't added it to the question previously :P :) – Jaffer Wilson Feb 08 '17 at 05:31
  • 4
    If I may voice my opinion, @JafferWilson , consider starting to learn using Python or Perl and their respective `json` APIs (and maybe learn those in the process ). I understand it's fun and simple sometimes to use `sed` or other tools, but `json` APIs were created specifically for that purpose. Of course, structuring `json` data properly. Not trying to teach you what to do, but seriously - you can save yourself a lot of time if you start using proper tools for proper job. – Sergiy Kolodyazhnyy Feb 08 '17 at 05:32
  • 1
    Also , can you please edit your question to include that you want `http://` part removed. Because without that, it makes Zanna's and my answer effectively only half-correct. What about `https://` ? do you want those removed as well ? – Sergiy Kolodyazhnyy Feb 08 '17 at 05:34
  • I appreciate @Serg suggestion. I really want to learn python at least. But the thin is currently workload is extreme high and I could not manage to have time to eat or sleep. But I can assure that I will learn. – Jaffer Wilson Feb 08 '17 at 05:40
  • @Serg I can do that with a simple `sed` command from command line. – Jaffer Wilson Feb 08 '17 at 05:41
  • @JafferWilson well, I've included removal of http:// into my answer. Do you want me to roll back to original one, with only slash removal or leave it as is ? – Sergiy Kolodyazhnyy Feb 08 '17 at 05:48
  • @Serg no leave it. – Jaffer Wilson Feb 08 '17 at 06:08

4 Answers4

6

If you insist on using sed, you could just match the /" combination, to remove the last / in every field, assuming it will not occur somewhere you want to keep it (which should be fairly reliable in this case)

$ sed 's|/"|"|g' file
{"url":"http://example.com/vary/file","originalUrl":"http://example.com/vary/file","applications":[{.........}]}

I used | to delimit instead of / to save a backslash. You need g for multiple matches on the same line.

Here's a way to take out the http:// as well in the same call:

$ sed -r 's|"http://([^"]+)/"|"\1"|g' url
{"url":"example.com/vary/file","originalUrl":"example.com/vary/file","applications":[{.........}]}

([^"]+) will match anything between "http:// and /" that isn't a ". We save this part with () and reference with \1.

Zanna
  • 69,223
  • 56
  • 216
  • 327
6

I took the liberty to modify OP's input slightly, because as it stands , it's not properly structured json data (due to the {...} part) and implemented a small python script that works with multiple dictionaries, assuming that we're dealing with a dictionary per line. Additionally, as has been discussed in the comments to the question, OP also wanted to remove http:// part.

The script below implements everything discussed above.

#!/usr/bin/env python
import json,sys

with open(sys.argv[1]) as f: 
    for line in f:
        data=json.loads(line)
        if data["url"][-1] == '/':
            data["url"]=data["url"][:-1].replace('http://','')
        if data["originalUrl"][-1] == '/':
            data["originalUrl"]=data["originalUrl"][:-1].replace('http://','')
        json.dump(data,sys.stdout)
        print("")

Test run:

$ cat input.txt                                                                                 
{"url":"http://example.com/vary/file/","originalUrl":"http://example.com/vary/file/","applications":[{"somedata": "blah"}]}
{"url":"http://another-example.com/vary/file/","originalUrl":"http://example.com/vary/file/","applications":[{"somedata": "blah"}]}
$ ./remove_slash.py input.txt                                                                   
{"url": "example.com/vary/file", "applications": [{"somedata": "blah"}], "originalUrl": "example.com/vary/file"}
{"url": "another-example.com/vary/file", "applications": [{"somedata": "blah"}], "originalUrl": "example.com/vary/file"}
Sergiy Kolodyazhnyy
  • 103,293
  • 19
  • 273
  • 492
5

A late one:

a simple, purely text based python option:

#!/usr/bin/env python3
import sys

with open(sys.argv[1]) as data:
    for l in data:
        print(("").join(l.strip().replace("http://", "").rsplit("/", 1)))

Or, just for fun, another way of saying it:

#!/usr/bin/env python3
import sys

[print(("").join(l.strip().replace("http://", "").rsplit("/", 1))) for l in open(sys.argv[1])]

doing both the string replacement/removal (http://) and the slash removal in appr. 47 seconds on 14.000.000 million lines, on my ancient system.

To use:

python3 /path/to/script.py /path/to/inputfile > outputfile

Explanation

As usual, python is quite readable, but in detail:

  • rsplit("/", 1) splits the line from the right (hence the r) by the delimiter / only once (hence the 1)
  • l.replace("http://", "") replaces http:// by an empty string
  • ("").join() joins the list, that was created by rsplit() again into a line
Jacob Vlijm
  • 82,471
  • 12
  • 195
  • 299
0

Input JSON file (test.json):

{"url":"http://example.com/vary/file/","originalUrl":"http://example.com/vary/file/"}
  • Code to modify as per requirement and re-write to same file:

    import json
     with open("test.json") as fh:
        data = json.load(fh)
    
     for k,v in data.items():
        data[k] = v.replace("http://","").strip("/")
    
     with open("test.json","w") as fh:
        json.dump(data,fh)
    

Output:

{"url": "example.com/vary/file", "originalUrl": "example.com/vary/file"}

All operations at once, replaces http:// with "" and strips / at the end of the string.

replace("http://","").strip("/")
StackGuru
  • 101