4

I have a list of urls like:

hxxp://url.com/subpage.html
hxxp://www.url2.com/index.php
hxxp://subdomain.url3.com/somepage.php
...

How can I use grep to match the domain names only?

All the urls have a / after the domain. And there are a lot of tlds, not sure how many, the list is quite big.

hillacma
  • 53
  • 1
  • 2
  • 5

3 Answers3

5

In order to use non-greedy regexes with grep you will need to use the -P option and the -o option outputs only the matching portion. You will also need to use lookarounds so that part of the match is not included in the output.

grep -Po '.*?//\K.*?(?=/)'

Example:

$ echo 'hxxp://subdomain.url3.com/somepage.php' | grep -Po '.*?//\K.*?(?=/)'
subdomain.url3.com
Dennis Williamson
  • 106,229
  • 19
  • 167
  • 187
0

There is a great place to test your regex skills here. The expression should look like

.*?//(.*)/

You will need to loop through all the results. On the page that I have given you, you can put this expression in and a web address and it will then show you what matched. Also remember that you will then have the captured variable for only a limited time.

Robert Leckie
  • 476
  • 2
  • 3
0

If they only have one TLD after the domain then this should work (I'm assuming you want to exclude the subdomain):

[^\./]*\.[^\./]*/

It still has the trailing slash though, but you can just pipe that through sed.

Hydaral
  • 1,722
  • 9
  • 11