Extract Links from a sitemap(xml)

Question

Lets say I have a sitemap.xml file with this data:

<url>
<loc>http://domain.com/pag1</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>http://domain.com/pag2</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>http://domain.com/pag3</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>

I want to extract all the locations from it (data between <loc> and </loc>).

Sample output be like:

http://domain.com/pag1
http://domain.com/pag2
http://domain.com/pag3

How to do this?

Windows 7 Ultimate X64 / Windows 8 Pro X64 or Ubuntu 12.04 Linux. — Akshat Mittal, Aug 27 '12 at 13:13
Nice setup. Using Terminal on the Ubuntu box, [my answer below](http://superuser.com/a/466874/152250) will get you what you need. — bobmagoo, Aug 27 '12 at 13:22
You can also use any text editor like SublimeText2 which can use regexp, you can get all data with it, or you can use python see my answer below. — Ishikawa Yoshi, Aug 27 '12 at 14:35

score 9 · Answer 1 · answered Aug 27 '12 at 11:40

9

If you're on a Linux box or something with the grep tool, you can just run:

grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml

answered Aug 27 '12 at 11:40

bobmagoo

824
5
13

This worked but with a lot of mistakes (Incomplete URL's). – Akshat Mittal Aug 28 '12 at 13:46
Weird, I just ran this over [Google's sitemap.xml file](http://www.google.com/sitemap.xml) and didn't see any issues. Which ones did it miss? – bobmagoo Aug 29 '12 at 17:46
This missed many url's that contained "?" and "+". – Akshat Mittal Aug 30 '12 at 09:50
Thank you. For anybody wants to save to file `grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml > links.txt` – trante Sep 18 '14 at 11:51
+1 This is actually a very simple but powerful solution. – ABCD May 10 '15 at 13:06

Ishikawa Yoshi · Accepted Answer · 2012-08-28T14:20:19.397

2

You can use python script here

This script get any links started with http

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('>(http:\/\/.+)<',d)
    for i in data:
        print i

And in your case next script find all data wraped in tags

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
    for i in data:
        print i

Here nice tool to play with regexp if you not familiar with it.

if you need to load remote file you can use next code

import urllib2 as ur
import re

f = ur.urlopen(u'http://server.com/sitemap.xml')
res = f.readlines()
for d in res:
  data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
  for i in data:
    print i

edited Aug 28 '12 at 14:20

answered Aug 27 '12 at 12:00

Ishikawa Yoshi

973
2
13
27

How to load a remote file like `http://server.com/sitemap.xml`. I am not so known to Python – Akshat Mittal Aug 28 '12 at 14:09
you mean load with python? – Ishikawa Yoshi Aug 28 '12 at 14:14
Yup, Like you have used `f = open('sitemap.xml','r')` to open the file, How to open a remote file on http server? – Akshat Mittal Aug 28 '12 at 14:16
i update my post, you need to use urllib2 module – Ishikawa Yoshi Aug 28 '12 at 14:22
Shows error `AttributeError: 'list' object has no attribute 'findall'` – Akshat Mittal Aug 28 '12 at 14:33
do you import re module? – Ishikawa Yoshi Aug 28 '12 at 14:37
let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/4660/discussion-between-ishikawa-yoshi-and-akshat-mittal) – Ishikawa Yoshi Aug 28 '12 at 14:38
Very good answer! A reminder that if your links are in HTTPS, change *http* to *https* in the code). – My Name May 20 '18 at 14:04

score 2 · Answer 3 · answered Aug 07 '15 at 15:55

2

This could be accomplished by a single sed command, which seems to be more solid than the grep solution:

sed '/<loc>/!d; s/[[:space:]]*<loc>\(.*\)<\/loc>/\1/' inputfile > outputfile

(found at: linuxquestions.org)

answered Aug 07 '15 at 15:55

LarS

235
1
3
9

Your solution works perfectly. – Baptiste Donaux Apr 21 '16 at 19:36
tried it as sed '//!d; s/[[:space:]]*\(.*\)<\/loc>/\1/' sitemap.xml > links.txt but it outputs the same xml content. it worked with the above grep command but I am trying to figure out why it did not work – Mike Apr 26 '17 at 08:27
I think it's because you did not escape the () with \( and \). – LarS Apr 26 '17 at 21:03

Siva Charan · Answer 4 · 2012-08-27T11:45:08.680

1

Using XSLT, you can render it out with XPath

/url/loc

edited Aug 27 '12 at 11:45

answered Aug 27 '12 at 11:39

Siva Charan

4,865
2
25
29

4

Could you maybe expand your answer and show the XSLT instructions and the XPath queries needed? – slhck Aug 27 '12 at 14:44
@slhck Exactly what I wanted to say,The answer should be more explainatory. – Akshat Mittal Aug 28 '12 at 09:28
I read a few more about this and got this working at last. Upvoting but not a really good answer to be choosen. – Akshat Mittal Aug 28 '12 at 13:56

score 0 · Answer 5 · answered Nov 25 '15 at 01:01

0

The XSLT solution:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:s="http://www.sitemaps.org/schemas/sitemap/0.9">

  <xsl:output method="text" />

  <xsl:template match="s:url">
    <xsl:value-of select="s:loc" />
    <xsl:text>
</xsl:text>
  </xsl:template>

</xsl:stylesheet>

answered Nov 25 '15 at 01:01

Jan Tomka

101
2

For years i've been using regex etc. for this but XSLT is so cool in this case :) For complete noobs in XSLT (like me) it'd be nice to add that only thing you have to do is: save this code as stylesheet.xsl and add a row to your xml document with link to stylesheet Then open your xml in browser (it won't work when opening as local file, you have to get it via http) – Łukasz Rysiak Aug 08 '16 at 12:41

score 0 · Answer 6 · answered Jan 01 '22 at 13:44

You can open your sitemap.xml file in Notepad++.

Then in the menu Search → Replace (CTRL+H) specify:

Find what: </loc>.*?<loc>

Replace with: \r\n

Set Search mode to Regular Expression

and then click Replace All button.

Additionally you can sort the links via the Menu Edit → Line Operations → Sort Lines in Ascending Order

Extract Links from a sitemap(xml)

6 Answers6

Linked