6

Lets say I have a sitemap.xml file with this data:

<url>
<loc>http://domain.com/pag1</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>http://domain.com/pag2</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>http://domain.com/pag3</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>

I want to extract all the locations from it (data between <loc> and </loc>).

Sample output be like:

http://domain.com/pag1
http://domain.com/pag2
http://domain.com/pag3

How to do this?

Akshat Mittal
  • 2,255
  • 3
  • 24
  • 44

6 Answers6

9

If you're on a Linux box or something with the grep tool, you can just run:

grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml

bobmagoo
  • 824
  • 5
  • 13
2

You can use python script here

This script get any links started with http

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('>(http:\/\/.+)<',d)
    for i in data:
        print i

And in your case next script find all data wraped in tags

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
    for i in data:
        print i

Here nice tool to play with regexp if you not familiar with it.

if you need to load remote file you can use next code

import urllib2 as ur
import re

f = ur.urlopen(u'http://server.com/sitemap.xml')
res = f.readlines()
for d in res:
  data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
  for i in data:
    print i
Ishikawa Yoshi
  • 973
  • 2
  • 13
  • 27
2

This could be accomplished by a single sed command, which seems to be more solid than the grep solution:

sed '/<loc>/!d; s/[[:space:]]*<loc>\(.*\)<\/loc>/\1/' inputfile > outputfile

(found at: linuxquestions.org)

LarS
  • 235
  • 1
  • 3
  • 9
  • Your solution works perfectly. – Baptiste Donaux Apr 21 '16 at 19:36
  • tried it as sed '//!d; s/[[:space:]]*\(.*\)<\/loc>/\1/' sitemap.xml > links.txt but it outputs the same xml content. it worked with the above grep command but I am trying to figure out why it did not work – Mike Apr 26 '17 at 08:27
  • I think it's because you did not escape the () with \( and \). – LarS Apr 26 '17 at 21:03
1

Using XSLT, you can render it out with XPath

/url/loc
Siva Charan
  • 4,865
  • 2
  • 25
  • 29
0

The XSLT solution:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:s="http://www.sitemaps.org/schemas/sitemap/0.9">

  <xsl:output method="text" />

  <xsl:template match="s:url">
    <xsl:value-of select="s:loc" />
    <xsl:text>
</xsl:text>
  </xsl:template>

</xsl:stylesheet>
Jan Tomka
  • 101
  • 2
  • For years i've been using regex etc. for this but XSLT is so cool in this case :) For complete noobs in XSLT (like me) it'd be nice to add that only thing you have to do is: save this code as stylesheet.xsl and add a row to your xml document with link to stylesheet Then open your xml in browser (it won't work when opening as local file, you have to get it via http) – Łukasz Rysiak Aug 08 '16 at 12:41
0

You can open your sitemap.xml file in Notepad++.

Then in the menu Search → Replace (CTRL+H) specify:

Find what: </loc>.*?<loc>

Replace with: \r\n

Set Search mode to Regular Expression

and then click Replace All button.

Additionally you can sort the links via the Menu Edit → Line Operations → Sort Lines in Ascending Order

Stano
  • 141
  • 3