Jump to content

HTML tag issue

- - - - -

  • Please log in to reply
2 replies to this topic

#1
onething

onething

    Programmer

  • Members
  • PipPipPipPip
  • 118 posts
Hi, I'm trying to input an html tag into a zennoposter scraper. Perhaps out of lack of knowledge of html tags I'm not getting the tag right. Parting from the basis of the existence of this table:

HTML input tag

I used input:text and when I did it covered letters but didn't cover numbers. That to me at least sounded logical. So when I use input:id or input:name it covers the existence of numbers and the alphabet. But what do I need in order to cover the existence of non alpha numeric keyboard strokes? I'm looking for a tag that'll realise I'm inputting stuff like / or : (for inurl:mysite/mypage queries, for example)

thanks

#2
ZekeDragon

ZekeDragon

    Writes binary right handed and hex left handed

  • Moderators
  • 2,103 posts
Could you post a code sample of your using the input tag?
Wow I changed my sig!

#3
onething

onething

    Programmer

  • Members
  • PipPipPipPip
  • 118 posts

ZekeDragon said:

Could you post a code sample of your using the input tag?

I would but it's full of terms that only belong to the Zennoposter specifics:

<?xml version="1.0" encoding="utf-8"?>

<Project Name="google.xml" ProxyFilter="" Flags="DLCTL_DLIMAGES, DLCTL_VIDEOS, DLCTL_BGSOUNDS, DLCTL_NO_SCRIPTS, DLCTL_NO_JAVA, DLCTL_NO_RUNACTIVEXCTLS, DLCTL_NO_DLACTIVEXCTLS, DLCTL_NO_FRAMES, CMD_ALLOWPOPUP, CMD_DISGUISE" Version="3.0">

  <Step ID="●8●2●3●2●4●3●" Type="Web" x="30" y="30">

    <Branch ID="cca-3867" Type="WebBrowser" PictureIndex="" Action="Set" Name="CMD_CLEARCOOKIE" Comment="clean cookies">

      <Parameters />

      <Results />

    </Branch>

    <Branch ID="nav-7168" Type="WebBrowser" PictureIndex="" Action="Set" Name="CMD_NAVIGATE" Comment="Load google.com">

      <Parameters>

        <Value>google.com-|-page</Value>

      </Parameters>

      <Results />

    </Branch>

    <Branch ID="vc-3837" Type="HTMLElement" PictureIndex="107235567" Action="Set" Name="value" Comment="our word for search">

      <Parameters>

        <Finder>

          <Document />

          <Form />

          <Element>

            <TabPath>page</TabPath>

            <DocPath>0</DocPath>

            <Tag>input:text</Tag>

            <FormNumber>0</FormNumber>

            <SearchCondition AttrName="name" AttrValue="q" SearchKind="text" GroupNumber="0" Number="0" />

            <SearchCondition AttrName="fulltag" AttrValue="input:text" SearchKind="text" GroupNumber="2" Number="0" />

          </Element>

        </Finder>

        <Value>zennoposter</Value>

      </Parameters>

      <Results />

    </Branch>

    <Branch ID="re-8371" Type="HTMLElement" PictureIndex="1152725869" Action="RiseEvent" Name="click" Comment="click on search button">

      <Parameters>

        <Finder>

          <Document />

          <Form />

          <Element>

            <TabPath>page</TabPath>

            <DocPath>0</DocPath>

            <Tag>input:submit</Tag>

            <FormNumber>0</FormNumber>

            <SearchCondition AttrName="name" AttrValue="btnG" SearchKind="text" GroupNumber="0" Number="0" />

            <SearchCondition AttrName="fulltag" AttrValue="input:submit" SearchKind="text" GroupNumber="2" Number="0" />

          </Element>

        </Finder>

        <Emulation>False</Emulation>

      </Parameters>

      <Results />

    </Branch>

    <Branch ID="≡1≡7≡0≡0≡7≡6≡7≡8≡6≡2≡" Type="WebBrowser" PictureIndex="" Action="Get" Name="CMD_DOM_HTML" Comment="parse all url">

      <Parameters>

        <Value>page-|-(?<=\'\)\" href\=\").*?(?=\"\>\<EM\>)-|-all</Value>

      </Parameters>

      <Results />

    </Branch>

    <Branch ID="≡5≡3≡9≡9≡2≡7≡7≡9≡" Type="Macros" PictureIndex="" Action="Get" Name="{-File.AppendString-|-\Results\test.txt-|-{-FieldData.FieldData-|-●8●2●3●2●4●3●-|-≡1≡7≡0≡0≡7≡6≡7≡8≡6≡2≡-}-|-true-}" OutGood="●3●8●1●5●6●3●|re-7410" Comment="save result">

      <Parameters />

      <Results />

    </Branch>

  </Step>

  <Step ID="●3●8●1●5●6●3●" Type="Web" x="402" y="372">

    <Branch ID="re-7410" Type="HTMLElement" PictureIndex="803815211" Action="RiseEvent" Name="click" Comment="click next" OutGood="●8●2●3●2●4●3●|≡1≡7≡0≡0≡7≡6≡7≡8≡6≡2≡">

      <Parameters>

        <Finder>

          <Document />

          <Form />

          <Element>

            <TabPath>page</TabPath>

            <DocPath>0</DocPath>

            <Tag>a</Tag>

            <FormNumber>-1</FormNumber>

            <SearchCondition AttrName="id" AttrValue="pnnext" SearchKind="regexp" GroupNumber="0" Number="0" />

          </Element>

        </Finder>

        <Emulation>true</Emulation>

      </Parameters>

      <Results />

    </Branch>

  </Step>

</Project>

Basically I would insert inurl:sports/soccer, then I'd scrape inurl:cnn.com/sports and I'd get different results. No results at the first try, only one at the second, despite both instances having full pages of links. I concluded it was an html tag issue when I managed to get alphanumeric characters scraped, at least partially, as long as I changed the tag from text to name or id, because with text it was only scraping alphabetic characters. But no matter what tag I put in, it wouldn't scrape stuff inurls without .com, which was really annoying and led me to think I didn't know of an html tag that covered non alphanumeric characters. But then I saw I was scraping inurl:cnn.com/sports relatively ok, which brought me back to thinking it was a regexp issue. Overall I'm just terribly confused with it anyways, I think i'll just move onto other templates instead.




1 user(s) are reading this topic

0 members, 1 guests, 0 anonymous users