0

How can I extract two or three "parallel" XPaths on a web page to/in a format that can be imported into a spreadsheet?

Here is an example: on https://www.amazon.co.uk/s?rh=n%3A20606777031&language=en_GB&brr=1&pf_rd_i=3012216031&pf_rd_m=A3P5ROKL5A1OLE&pf_rd_p=7accc6fa-9287-4d54-804e-88e159c5cb45&pf_rd_r=P0TZ793D83HJK8616YKC&pf_rd_s=merchandised-search-top-2&pf_rd_t=101&rd=1&ref=uk_outsbcd_1 most items have an actual price and an RRP. I would like to extract all prices as well as RRPs to something that can be used in a spreadsheet.

An ugly XPath to get the RRP is

/html/body/div[1]/div[2]/div[1]/div[1]/div/span[1]/div[1]/div[11]/div/div/div/div/div[2]/div[3]/div/a/div/span[2]/span[2]/text()

and to get the (non-decimal part of the) price

/html/body/div[1]/div[2]/div[1]/div[1]/div/span[1]/div[1]/div/div/div/div/div/div[2]/div[3]/div/a/span/span[2]/span[2]/text()

The product name:

/html/body/div[1]/div[2]/div[1]/div[1]/div/span[1]/div[1]/div/div/div/div/div/div[2]/div[1]/h2/a/span

If I execute these XPaths with $x, expand the result, right click on the result here (this is a screenshot from the console in Firefox):

screenshot of the console in Firefox

and select Copy message (in Firefox) I get something that looks like this:

Array(39) [ #text, #text, #text, #text, #text, #text, #text, #text, #text, #text, … ]
​
0: #text "2"
​
1: #text "7"
​
2: #text "8"

when pasted into a text editor.

I would like the output in a format that required less postprocessing before I can insert it into a spreadsheet. Furthermore, if I get many (100+? 1000+?) results Firefox doesn't list them all.

As a first step I would like

Array(39) [ #text, #text, #text, #text, #text, #text, #text, #text, #text, #text, … ]
​
0: #text "2"
​
1: #text "7"
​
2: #text "8"

to look like this:

2
7
8

That is, just the numbers/result, one on each row (which I then can paste into a spreadsheet, one column for each XPath, and the name will match with the price and the RRP).

The next step would be to make a csv/tsv with everything included, something like:

prod1,2,7
prod2,7,11
prod3,8,14

(I prefer tab to comma, but I used comma here because it made it clearer)

Which in a spreadsheet would come out like this:

name price RRP
prod1 2 7
prod2 7 11
prod3 8 14

How can I do this? I imagine a Javascript solution that I execute in the browser console where the input is one¹ or more XPaths and that the output is a string that I can either copy from the console and paste into a spreadsheet (or maybe that string is inserted directly in the pasteboard?)

Since this is a hack that I will only use once at a specific site, I am prepared to do some manual cleanup or manipulation afterwards.

I suck at using the browser console as well as Javascript...That is why I ask this question.

(My final goal is to calculate the price change in percent)

¹ In the case of one XPath as input, I will execute this several times, and copy-paste into a spreadsheet after each execution. In the case of two or more XPaths as input, I would just execute it once.

d-b
  • 636
  • 1
  • 6
  • 24

0 Answers0