Jump to content

Breaking Up Strings

- - - - -

This topic has been archived. This means that you cannot reply to this topic.
11 replies to this topic

#1
chili5

chili5

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 7,247 posts
I have an input file that looks like this:

Quote

<a>sample link</a>
<a rel="" href="http://dwite.ca/">link with rel</a>
<a href="http://compsci.ca/" rel="nofollow">link with no follow</a>
<a href="http://compsci.ca/blog" rel="external">more rels</a>
<a href="http://compsci.ca/v3/viewforum.php?f=131" title="">link</a>

I'm trying to extract from the String the substring rel="" and anything that could possibly be in the quotes inside rel.

My current code is:


/*

 * To change this template, choose Tools | Templates

 * and open the template in the editor.

 */


package dwite;


import java.io.*;

import java.util.*;

import java.util.regex.*;

/**

 *

 * @author brocj1112

 */

public class links {

    public static void main(String[] args) throws IOException {

        //Code goes here

        Scanner fin = new Scanner(new FileReader("IN/links.in"));

        String sHTML;

        String sLink;

        String sRel = "";

        


        while (fin.hasNextLine()) {

            sHTML = fin.nextLine();


            // get just the link part

            sLink = sHTML.substring(sHTML.indexOf("<a"),sHTML.indexOf("</a>")+4);


            System.out.println(sLink);

            if (!sLink.contains("rel=")) {

                // rel is not found

                // so add it to the end of the link

                sLink = sLink.substring(0,sLink.indexOf(">")) + " rel=\"nofollow\">"

                        + sLink.substring(sLink.indexOf(">")+1);

            } else if (sLink.contains("rel=\"\""))  {

                sLink = sLink.replace(" rel=\"\"", "");

                sLink = sLink.substring(0,sLink.indexOf(">")) + " rel=\"nofollow\">"

                        + sLink.substring(sLink.indexOf(">")+1);

            } else if (!sLink.contains("nofollow")) {

                // rel already exists so add nofollow to the end of the rel

                sRel = sLink.substring(sLink.indexOf("rel=\""),sLink.indexOf("rel=\"", sLink.indexOf("rel=\"")));

            }

            

        }

        

        fin.close();

    }

}


The code that is trying to get the rel="" portion is:


sLink.substring(sLink.indexOf("rel=\""),sLink.indexOf("rel=\"", sLink.indexOf("rel=\"")));


However this code only returns:

rel=

and nothing else. Any ideas as to how to accomplish this?

#2
John

John

    Writes binary right handed and hex left handed

  • Moderators
  • 6,321 posts
There are a few ideas that come to mind, but your best bet is probably using regular expressions.

#3
chili5

chili5

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 7,247 posts
Is using regular expressions with Java, the same as in PHP?

#4
John

John

    Writes binary right handed and hex left handed

  • Moderators
  • 6,321 posts
There is a 95% chance the answer is yes.

#5
WingedPanther

WingedPanther

    A spammer's worst nightmare

  • Moderators
  • 16,831 posts
Pretty much. Java includes a regex library that is very robust.
Programming is a branch of mathematics.
My CodeCall Blog | My Personal Blog

#6
chili5

chili5

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 7,247 posts
Thanks guys! :)

#7
Skel

Skel

    Learning Programmer

  • Members
  • PipPipPip
  • 33 posts
if(sLink.contains("rel=")) {

    String[] temp = sLink.split(" ");

    String rel = temp[1];

}

Personally I'd do it like that.

#8
chili5

chili5

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 7,247 posts
No you can't do that. It won't work. Since rel can appear anywhere in the string.

#9
Turk4n

Turk4n

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 3,847 posts

Skel said:

if(sLink.contains("rel=")) {

    String[] temp = sLink.split(" ");

    String rel = temp[1];

}

Personally I'd do it like that.

I personally think you did it wrong on purpose !
Posted Image

#10
chili5

chili5

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 7,247 posts
Skel, there is no chance at all that your code would work. If my link was:

<a href="#">test</a>

Your code would return "<a". Which is way wrong!

I'm looking into how I can do this using the regular expressions but I'm almost certain my regexp will look something like this:

String sPattern = "rel=\"[\s\w]{1,}+\"";

Haven't tried it yet but I'm thinking it will look a little like that.

#11
chili5

chili5

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 7,247 posts
Thanks everyone, I got it. :D

String sRegex = "rel=\"[a-zA-Z\\.]{1,}+\"";

// ...

else if (!sLink.contains("nofollow")) {
                // rel already exists so add nofollow to the end of the rel
                mMatch = p.matcher(sLink);
                
                while (mMatch.find()) {
                    sRel = sLink.substring(mMatch.start(), mMatch.end());
                    sRel = sRel.substring(0,sRel.length()-1) + " nofollow\"";
                }
                sLink = sLink.replaceAll(sRegex, sRel);
            }

:)

#12
Turk4n

Turk4n

    Writes binary right handed and hex left handed

  • Members
  • PipPipPipPipPipPipPipPipPip
  • 3,847 posts

chili5 said:

Thanks everyone, I got it. :D


String sRegex = "rel=\"[a-zA-Z\\.]{1,}+\"";


// ...


else if (!sLink.contains("nofollow")) {

                // rel already exists so add nofollow to the end of the rel

                mMatch = p.matcher(sLink);

                

                while (mMatch.find()) {

                    sRel = sLink.substring(mMatch.start(), mMatch.end());

                    sRel = sRel.substring(0,sRel.length()-1) + " nofollow\"";

                }

                sLink = sLink.replaceAll(sRegex, sRel);

            }


:)

Great to hear :)
Posted Image