I need to write a Perl script to read in a file, and delete anything inside < >, even if they're on different lines. That is, if the input is:
Hello, world. I <enjoy eating
bagels. They are quite tasty.
I prefer when I ate a bagel to
when I >ate a sandwich. <I also
like >bananas.
I want the output to be:
Hello, world. I ate a sandwich. bananas.
I know how to do this if the text is on 1 line with a regex. But I don't know how to do it with multiple lines. Ultimately I need to be able to conditionally delete parts of a template so I can generate parametrized files for config files. I thought perl would be a good language but I am still getting the hang of it.
Edit: Also need more than 1 instance of <>
-
You may want to check out a Perl module Text::Balanced, part of the core distribution. I think it'll be of help for you. Generally, one wants to avoid regexes to do that sort of thing IF the subject text is likely to have an inner set of delimiters, it can get very messy.
rlbond : Good advice, but not needed in this case. Will definitely keep in mind though. -
In Perl:
#! /usr/bin/perl use strict; my $text = <>; $text =~ s/<[^>]*>//g; print $text;
The regex substitutes anything starting with a < through the first > (inclusive) and replaces it with nothing. The g is global (more than once).
EDIT: incorporated comments from Hynek and chaos
Andrew Hare : +1 Nice (complete) example!Hynek -Pichi- Vychodil : It's little bit ineffective. To split it and join again. perl -0777 -pe 's/<[^>]*>//gm'chaos : the /m modifier isn't helping. It means 'treat as multiline', i.e. match ^ and $ at newlines, not 'this is multiline'. /s, treat as single line, is actually more what you'd want, but you don't need it because your pattern isn't concerned with whitespace.Alan Moore : I would put both angle brackets in the negated character class: s/<[^<>]*>//g. Otherwise, you could match from, which probably isn't what you want. rlbond : Very useful. Chaos's answer, however, is more adaptable towards multi-character delimiters, I.E. using . and /s rather than [^(delimiter)] +1 for great advice though. -
Ineffective one-liner way
perl -0777 -pe 's/<.*?>//gs'
same as program
local $/; my $text = <>; s/<.*?>//gs; print $text;
It depends how big text you want convert here is more effective one-liner consuming line by line
perl -pe 'if ($a) {(s/.*?>// and do {s/<.*?>//g; $a = s/<.*//s;1}) or $_=q{}} else {s/<.*?>//g; $a = s/<.*//s}'
same as program
my $a; while (<>) { if ($a) { if (s/.*?>//) { s/<.*?>//g; $a = s/<.*//s; } else { $_ = q{} } } else { s/<.*?>//g; $a = s/<.*//s; } print; }
chaos : As noted re CoverosGene's answer, /m isn't necessary or helpful.Hynek -Pichi- Vychodil : Yes, you are right. -
local $/; my $text = <>; s/<.*?>//gs; print $text;
daotoad : If your string looks like this:ghi>, your regex leaves 'ghi>'. If nested or escaped brackets and other perverse cases "never happen" the regex is fine. To handle the perverse cases, use Text::Balanced, even though the interface is weird. -
You might find How can I remove text within parentheses with a regex? helpful.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.