1

My use case is I need to parse text from wikipedia articles. There is a dump available at https://dumps.wikimedia.org/enwiki/20221001/ that contains the files I want. Essentially the articles are broken up into several pairs of compressed files: an xml document that consists of a subset of wikipedia articles, and a text file that contains metadata pertaining to the xml document. Typically, the xml documents run 200MB compressed, and the text files run less than 1MB compressed.

For example, here's a pair of files on the dump page referenced above:

enwiki-20221001-pages-articles-multistream1.xml-p1p41242.bz2 251.7 MB

enwiki-20221001-pages-articles-multistream-index1.txt-p1p41242.bz2 221 KB

Using WinZip (trial version) I am able to extract the text files. However, when I try to extract the xml file from the articles file, WinZip says the file is corrupt and offers to save what it was able to extract. Regardless of which compressed xml file I am trying to extract, it always saves the same amount -- approximately 3KB.

I thought the problem might be the file size, so I compressed a 4GB file and tried to extract the file, and it worked.

I'm not sure where to go with this.

Giacomo1968
  • 53,069
  • 19
  • 162
  • 212
Len White
  • 123
  • 3
  • Try to download again the file. If the same problem occurs, try unzipping using another program. Products I like are Bandizip and 7Zip. – harrymc Oct 17 '22 at 09:35
  • Thank you very much! I downloaded 7zip and it worked! If you post your comment as an answer, I'll accept it. – Len White Oct 17 '22 at 14:01

1 Answers1

1

Try to download again the file.

If the same problem occurs, try unzipping using another program.

Example products : 7Zip and Bandizip.

harrymc
  • 455,459
  • 31
  • 526
  • 924