24

I have a problem with viewing chunks of a very large text file. This file, approximately 19 GB, is obviously too big to view by any traditional means.

I have tried head 1 and tail 1 (head -n 1 and tail -n 1) with both commands piped together in various ways (to get at a piece in the middle) with no luck. My Linux machine running Ubuntu 9.10 cannot process this file.

How do I handle this file? My ultimate goal is to hone in on lines 45000000 and 45000100.

Oliver Salzburg
  • 86,445
  • 63
  • 260
  • 306
nicorellius
  • 6,625
  • 13
  • 44
  • 75
  • Thinking of writing a quick Python script to read the lines and print the ones I need to file, but I can imagine this taking a long time... – nicorellius Feb 21 '12 at 00:04
  • Are all the lines the same length? – Paul Feb 21 '12 at 00:10
  • @Paul - unfortunately, they are not the same length. – nicorellius Feb 21 '12 at 00:15
  • You can try [`split`](http://linux.die.net/man/1/split) to make the large file easier to work with. – iglvzx Feb 21 '12 at 00:24
  • 1
    Ok. Any processing of a file that large will take time, so the answers below will help that. If you want to extract just the part you are looking for and can estimate approximately where it is you can use `dd` to get the bit you are after. For example `dd if=bigfile of=extractfile bs=1M skip=10240 count=5` will extract 5MB from the file starting from the 10GB point. – Paul Feb 21 '12 at 01:38
  • Yes, I agree with you Paul. I wrote a Python script and it definitely took forever to process the file. I have the `sed` job running now and I imagine it will take quite a while to complete. But testing with the beginning of the file appears promising. Thanks. – nicorellius Feb 21 '12 at 07:12

4 Answers4

17

You should use sed.

sed -n -e 45000000,45000100p -e 45000101q bigfile > savedlines

This tells sed to print lines 45000000-45000100 inclusive, and to quit on line 45000101.

Kyle Jones
  • 6,174
  • 2
  • 24
  • 31
5

Create a MySQL database with a single table which has a single field. Then import your file into the database. This will make it very easy to look up a certain line.

I don't think anything else could be faster (if head and tail already fail). In the end, the application that wants to find line n has to seek through the whole file until is has found n newlines. Without some sort of lookup (line-index to byte offset into file) no better performance can be achieved.

Given how easy it is to create a MySQL database and import data into it, I feel like this is a viable approach.

Here is how to do it:

DROP DATABASE IF EXISTS helperDb;
CREATE DATABASE `helperDb`;
CREATE TABLE `helperDb`.`helperTable`( `lineIndex` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, `lineContent` MEDIUMTEXT , PRIMARY KEY (`lineIndex`) );
LOAD DATA INFILE '/tmp/my_large_file' INTO TABLE helperDb.helperTable (lineContent);
SELECT lineContent FROM helperTable WHERE ( lineIndex > 45000000 AND lineIndex < 45000100 );

/tmp/my_large_file would be the file you want to read.

The correct syntax to import a file with tab-delimited values on each line, is:

LOAD DATA INFILE '/tmp/my_large_file' INTO TABLE helperDb.helperTable FIELDS TERMINATED BY '\n' (lineContent);

Another major advantage of this is, if you decide later on to extract another set of lines, you don't have to wait hours for the processing again (unless you delete the database of course).

Oliver Salzburg
  • 86,445
  • 63
  • 260
  • 306
  • So this is a good solution, indeed. I got it to work with the `sed` command below, and identified my lines. But now I have a follow up question that the database method may be better suited for. I now need to delete a couple hundred lines from the file. – nicorellius Feb 21 '12 at 18:18
  • I'm sure `sed` could do that as well. Of course, if you had the data in the database it would be trivial to export a new file with just the lines you want. – Oliver Salzburg Feb 21 '12 at 18:22
  • Thanks again. I took the `sed` answer (because it gave me more immediate pleasure ;--) but gave you an up-vote because I will use your method in the future. I appreciate it. – nicorellius Feb 21 '12 at 18:37
  • I attempted to use you SQL code above and it seemed to process but then when I ran the query to view my lines, it just gave me the first column of the tab delimited line. Each of the lines is tab delimited. Is there any advice you could give me to get all the lines into the table, as expected? – nicorellius Feb 21 '12 at 21:05
  • 1
    You could try adding a `FIELDS TERMINATED BY '\n'` to the [`LOAD DATA`](http://dev.mysql.com/doc/refman/5.1/en/load-data.html) line. – Oliver Salzburg Feb 21 '12 at 22:35
  • OK, thanks. Not too familiar with this syntax, but am getting error when using this: `LOAD DATA INFILE '/tmp/my_large_file' INTO TABLE helperTable (lineContent) FIELDS TERMINATED BY '\n';` I've searched around through docs and nothing is popping out. Any thoughts? Sorry to bother you with this. – nicorellius Feb 21 '12 at 23:53
  • 1
    I'm sorry, there was a mistake in my code. I also added the correct syntax for your case (tested this time). – Oliver Salzburg Feb 22 '12 at 00:20
  • Awesome - thanks - I will test this later today. Appreciate your help. – nicorellius Feb 23 '12 at 00:47
3

Two good old tools for big files are joinand split. You can use split with --lines=<number> option that cut file to multiple files of certain size.

For example split --lines=45000000 huge_file.txt. The resulted parts would be in xa, xb, etc. Then you can head the part xb which would include the the lines you wanted. You can also 'join' files back to single big file.

Anssi
  • 131
  • 4
2

You have the right tools but are using them incorrectly. As previously answered over at U&L, tail -n +X file | head -n Y (note the +) is 10-15% faster than sed for Y lines starting at X. And conveniently, you don't have to explicitly exit the process as with sed.

tail will read and discard the first X-1 lines (there's no way around that), then read and print the following lines. head will read and print the requested number of lines, then exit. When head exits, tail receives a SIGPIPE signal and dies, so it won't have read more than a buffer size's worth (typically a few kilobytes) of lines from the input file.

Erich
  • 295
  • 2
  • 11