6

Whenever I copy formatted text from a PDF file which is formatted to have line breaks (or carriage returns), I need to find a way to remove these line breaks without removing the paragraph format.

To do this I need to use RegEx (Regular expressions) to only remove the line breaks which aren't preceded by a period.

So for example, if a string of text has a line break right after a period, that is obviously almost always a legitimate line break which will start a new paragraph. If a string of text has a line break mid-word or after a word with no period, it's simply part of the bad formatting I need to get rid of.

My problem is that I don't know how to use RegEx to make it only remove the ^p tags in word or CRLF or line breaks in any format under the conditions that it omits ones following a period.

Luke Allen
  • 61
  • 1
  • 1
  • 2
  • Please mention your operating system. On anything but windows, this is trivial. I take it you are using windows? What RegEx engine are you using? We need to know more details in order to provide you with a working RegEx. – terdon Sep 02 '12 at 11:57
  • Do you simply want to remove the line breaks?  I suspect you really want to replace them with spaces.  And what about line breaks after `?` or `!`?  Or `.)`, `?)`, or `!)`? – Scott - Слава Україні Aug 01 '13 at 00:00

4 Answers4

3

Solution for MS Word:

  1. Open Find & Replace (Ctrl+H) and check the "Use wildcards" option. If you don't see the "Use wildcards" option, click "More".
  2. Copy the following into the "Find What" box: ([!.])^0013
  3. Copy the following into the "Replace What" box: \1
  4. Click "Replace All"

Explanation:

  • [!.] means "find every symbol except dot"
  • ^0013 is a paragraph mark, so in the "Find What" we will find every non-dot symbol followed by a paragraph mark
  • Parentheses mean that we will place that non-dot symbol in memory to use later
  • \1 replaces our memorized symbol at the location where we find it

Note that the ^0013 is not inside the parentheses, so the final text would be without paragraph marks.

Indrek
  • 24,204
  • 14
  • 90
  • 93
mar4enk0
  • 31
  • 2
0

A much easier way to create/modify an address block before cutting and pasting it into an email or other document is to declare a 3/4 row table and type the address data into each row. Then get rid of the lines.

bummi
  • 1,703
  • 4
  • 16
  • 28
Keawe
  • 1
0

In Word try to find and replace the manual line break ^l with the paragraph mark ^p.

Indrek
  • 24,204
  • 14
  • 90
  • 93
hsawires
  • 426
  • 3
  • 10
  • 25
  • It's from a pdf all line breaks are ^p – Luke Allen Sep 02 '12 at 06:54
  • ok. try to replace ^p with a this will fix the paragraph marks but the only problem you will face that all paragraph will be just one paragraph. – hsawires Sep 02 '12 at 07:15
  • yeah that is what the question i posted is trying to solve I already knew to replace ^p with , need to replace only ^p that don't have before them, that makes the paragraphs maintained but not the formatting breaks – Luke Allen Sep 02 '12 at 07:23
  • I tried to save the PDF in acrobat into word document and it works fine, except that you may do extra works to clean up the doc file from unwanted texts. some other software may help you converting PDF2DOC – hsawires Sep 02 '12 at 07:32
0

Because sentences can end in more punctuation than a period I’ve updated hsawires’ answer to:

  1. Find every symbol except dot, question mark, exclamation point, close quote or colon.
  2. Additionally, in some cases you’ll want to add a space after \1 in the “Replace What” box to keep from combining the last word on one line with the first word on the next line.

Solution for MS Word:

  1. Open Find & Replace (Ctrl+H) and check the “Use wildcards” option.
  2. If you don’t see the “Use wildcards” option, click “More.”
  3. Copy the following into the “Find What” box: ([!.\?\!"':])^0013
  4. Copy the following into the “Replace What” box: \1
  5. Click “Replace All.”

Explanation:

[!.\?\!"':] means “find every symbol except dot, question mark, exclamation point, close quote or colon.” - ^0013 is a paragraph mark, so in the “Find What” we will find every non-dot symbol followed by a paragraph mark. - Parentheses mean that we will place that non-dot symbol in memory to use later. - \1 replaces our memorized symbol at the location where we find it.

Note that the ^0013 is not inside the parentheses, so the final text would be without paragraph marks.