Hi, I'm trying to input an html tag into a zennoposter scraper. Perhaps out of lack of knowledge of html tags I'm not getting the tag right. Parting from the basis of the existence of this table:
HTML input tag
I used input:text and when I did it covered letters but didn't cover numbers. That to me at least sounded logical. So when I use input:id or input:name it covers the existence of numbers and the alphabet. But what do I need in order to cover the existence of non alpha numeric keyboard strokes? I'm looking for a tag that'll realise I'm inputting stuff like / or : (for inurl:mysite/mypage queries, for example)
thanks
2 replies to this topic
#1
Posted 18 April 2011 - 06:12 AM
|
|
|
#2
Posted 18 April 2011 - 05:58 PM
#3
Posted 20 April 2011 - 05:49 AM
ZekeDragon said:
Could you post a code sample of your using the input tag?
I would but it's full of terms that only belong to the Zennoposter specifics:
<?xml version="1.0" encoding="utf-8"?>
<Project Name="google.xml" ProxyFilter="" Flags="DLCTL_DLIMAGES, DLCTL_VIDEOS, DLCTL_BGSOUNDS, DLCTL_NO_SCRIPTS, DLCTL_NO_JAVA, DLCTL_NO_RUNACTIVEXCTLS, DLCTL_NO_DLACTIVEXCTLS, DLCTL_NO_FRAMES, CMD_ALLOWPOPUP, CMD_DISGUISE" Version="3.0">
<Step ID="●8●2●3●2●4●3●" Type="Web" x="30" y="30">
<Branch ID="cca-3867" Type="WebBrowser" PictureIndex="" Action="Set" Name="CMD_CLEARCOOKIE" Comment="clean cookies">
<Parameters />
<Results />
</Branch>
<Branch ID="nav-7168" Type="WebBrowser" PictureIndex="" Action="Set" Name="CMD_NAVIGATE" Comment="Load google.com">
<Parameters>
<Value>google.com-|-page</Value>
</Parameters>
<Results />
</Branch>
<Branch ID="vc-3837" Type="HTMLElement" PictureIndex="107235567" Action="Set" Name="value" Comment="our word for search">
<Parameters>
<Finder>
<Document />
<Form />
<Element>
<TabPath>page</TabPath>
<DocPath>0</DocPath>
<Tag>input:text</Tag>
<FormNumber>0</FormNumber>
<SearchCondition AttrName="name" AttrValue="q" SearchKind="text" GroupNumber="0" Number="0" />
<SearchCondition AttrName="fulltag" AttrValue="input:text" SearchKind="text" GroupNumber="2" Number="0" />
</Element>
</Finder>
<Value>zennoposter</Value>
</Parameters>
<Results />
</Branch>
<Branch ID="re-8371" Type="HTMLElement" PictureIndex="1152725869" Action="RiseEvent" Name="click" Comment="click on search button">
<Parameters>
<Finder>
<Document />
<Form />
<Element>
<TabPath>page</TabPath>
<DocPath>0</DocPath>
<Tag>input:submit</Tag>
<FormNumber>0</FormNumber>
<SearchCondition AttrName="name" AttrValue="btnG" SearchKind="text" GroupNumber="0" Number="0" />
<SearchCondition AttrName="fulltag" AttrValue="input:submit" SearchKind="text" GroupNumber="2" Number="0" />
</Element>
</Finder>
<Emulation>False</Emulation>
</Parameters>
<Results />
</Branch>
<Branch ID="≡1≡7≡0≡0≡7≡6≡7≡8≡6≡2≡" Type="WebBrowser" PictureIndex="" Action="Get" Name="CMD_DOM_HTML" Comment="parse all url">
<Parameters>
<Value>page-|-(?<=\'\)\" href\=\").*?(?=\"\>\<EM\>)-|-all</Value>
</Parameters>
<Results />
</Branch>
<Branch ID="≡5≡3≡9≡9≡2≡7≡7≡9≡" Type="Macros" PictureIndex="" Action="Get" Name="{-File.AppendString-|-\Results\test.txt-|-{-FieldData.FieldData-|-●8●2●3●2●4●3●-|-≡1≡7≡0≡0≡7≡6≡7≡8≡6≡2≡-}-|-true-}" OutGood="●3●8●1●5●6●3●|re-7410" Comment="save result">
<Parameters />
<Results />
</Branch>
</Step>
<Step ID="●3●8●1●5●6●3●" Type="Web" x="402" y="372">
<Branch ID="re-7410" Type="HTMLElement" PictureIndex="803815211" Action="RiseEvent" Name="click" Comment="click next" OutGood="●8●2●3●2●4●3●|≡1≡7≡0≡0≡7≡6≡7≡8≡6≡2≡">
<Parameters>
<Finder>
<Document />
<Form />
<Element>
<TabPath>page</TabPath>
<DocPath>0</DocPath>
<Tag>a</Tag>
<FormNumber>-1</FormNumber>
<SearchCondition AttrName="id" AttrValue="pnnext" SearchKind="regexp" GroupNumber="0" Number="0" />
</Element>
</Finder>
<Emulation>true</Emulation>
</Parameters>
<Results />
</Branch>
</Step>
</Project>
Basically I would insert inurl:sports/soccer, then I'd scrape inurl:cnn.com/sports and I'd get different results. No results at the first try, only one at the second, despite both instances having full pages of links. I concluded it was an html tag issue when I managed to get alphanumeric characters scraped, at least partially, as long as I changed the tag from text to name or id, because with text it was only scraping alphabetic characters. But no matter what tag I put in, it wouldn't scrape stuff inurls without .com, which was really annoying and led me to think I didn't know of an html tag that covered non alphanumeric characters. But then I saw I was scraping inurl:cnn.com/sports relatively ok, which brought me back to thinking it was a regexp issue. Overall I'm just terribly confused with it anyways, I think i'll just move onto other templates instead.
1 user(s) are reading this topic
0 members, 1 guests, 0 anonymous users


Sign In
Create Account


Back to top









