Thursday, April 28, 2011

How can I delete characters between < and > in Perl?

I need to write a Perl script to read in a file, and delete anything inside < >, even if they're on different lines. That is, if the input is:

Hello, world. I <enjoy eating
bagels. They are quite tasty.
I prefer when I ate a bagel to
when I >ate a sandwich. <I also
like >bananas.

I want the output to be:

Hello, world. I ate a sandwich. bananas.

I know how to do this if the text is on 1 line with a regex. But I don't know how to do it with multiple lines. Ultimately I need to be able to conditionally delete parts of a template so I can generate parametrized files for config files. I thought perl would be a good language but I am still getting the hang of it.

Edit: Also need more than 1 instance of <>

From stackoverflow
  • You may want to check out a Perl module Text::Balanced, part of the core distribution. I think it'll be of help for you. Generally, one wants to avoid regexes to do that sort of thing IF the subject text is likely to have an inner set of delimiters, it can get very messy.

    rlbond : Good advice, but not needed in this case. Will definitely keep in mind though.
  • In Perl:

    #! /usr/bin/perl   
    use strict;
    
    my $text = <>;
    $text =~ s/<[^>]*>//g;
    print $text;
    

    The regex substitutes anything starting with a < through the first > (inclusive) and replaces it with nothing. The g is global (more than once).

    EDIT: incorporated comments from Hynek and chaos

    Andrew Hare : +1 Nice (complete) example!
    Hynek -Pichi- Vychodil : It's little bit ineffective. To split it and join again. perl -0777 -pe 's/<[^>]*>//gm'
    chaos : the /m modifier isn't helping. It means 'treat as multiline', i.e. match ^ and $ at newlines, not 'this is multiline'. /s, treat as single line, is actually more what you'd want, but you don't need it because your pattern isn't concerned with whitespace.
    Alan Moore : I would put both angle brackets in the negated character class: s/<[^<>]*>//g. Otherwise, you could match from , which probably isn't what you want.
    rlbond : Very useful. Chaos's answer, however, is more adaptable towards multi-character delimiters, I.E. using . and /s rather than [^(delimiter)] +1 for great advice though.
  • Ineffective one-liner way

    perl -0777 -pe 's/<.*?>//gs'
    

    same as program

    local $/;
    my $text = <>;
    s/<.*?>//gs;
    print $text;
    

    It depends how big text you want convert here is more effective one-liner consuming line by line

    perl -pe 'if ($a) {(s/.*?>// and do {s/<.*?>//g; $a = s/<.*//s;1}) or $_=q{}} else {s/<.*?>//g; $a = s/<.*//s}'
    

    same as program

    my $a;
    while (<>) {
        if ($a) {
            if (s/.*?>//) {
                s/<.*?>//g;
                $a = s/<.*//s;
            }
            else { $_ = q{} }
        }
        else {
            s/<.*?>//g;
            $a = s/<.*//s;
        }
        print;
    }
    
    chaos : As noted re CoverosGene's answer, /m isn't necessary or helpful.
    Hynek -Pichi- Vychodil : Yes, you are right.
  • local $/;
    my $text = <>;
    s/<.*?>//gs;
    print $text;
    
    daotoad : If your string looks like this: ghi>, your regex leaves 'ghi>'. If nested or escaped brackets and other perverse cases "never happen" the regex is fine. To handle the perverse cases, use Text::Balanced, even though the interface is weird.
  • You might find How can I remove text within parentheses with a regex? helpful.

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.