Spider, writing, help

Help each other out

Spider, writing, help

Postby lar282 on Tue Jan 16, 2007 7:50 am

dgemily, I know u are a specialist but I would gladely take halp from anybody

Trying to modify my spider that uses imdb and takes thoose movies that only have one result. example Saw III
If one goes to
http://us.imdb.com/find?s=all&q=Saw III;tt=1
U should come to the web page directly not a search result. So here's the code. What happens is that XL presents Saw III but nothing happens when I click the result in the spider setup(using F2 database-spider) for testing

please help me out


//------------------------------------------------------------
url=http://us.imdb.com/find?s=all&q=%searchstring%;tt=1

results=<title>(?<display>.*?)\(.*?\)</title>

<meta name="title" content="(?<title>.*?)\(.*?\)"><meta

//Plot
<b class="ch">Plot Outline:</b> (?<plot>.*?)?. <a href="

// Coverart
<a name="poster" .*? title=".*?" src="(?<coverart>.*?)"

<a href="/Sections/Genres/.*?">(?<genre>.*?)</a>

<a href="/mpaa">MPAA</a>:</b>(?<rating>.*?)<br>

<b class="ch">Runtime:</b>(?<runtime>.*?)<br>

//Actors
// <table cellpadding="1" cellspacing="0"><tr><td colspan="4" align="left"><b class="blackcatheader">Cast overview, first billed only:
first billed only: </b></td></tr> (?<variable><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr>)
<td valign="top"><a href="/name/.*?">(?<actors>.*?)</a></td>


//------------------------------------------------------------
lar282
 
Posts: 1624
Joined: Thu Apr 01, 2004 4:13 pm
Location: Helsingborg, Sweden

Postby dgemily on Tue Jan 16, 2007 9:39 am

not sure if it's possible...

first: when you go to : http://us.imdb.com/find?s=all&q=Saw III;tt=1
in fact you are redirected to http://us.imdb.com/title/tt0489270/
so I'm not sure the spider search go there....

second point: in "results" line you don't have any url, and xlobby is looking for it .

and last point: this will not work if there are more than one result for a search.... ( because you are not redirected to the movie page if there are more than one result....)

imdb is not easy to manage it using spiders.....

later
dgemily
 
Posts: 793
Joined: Thu May 13, 2004 6:24 am
Location: Paris, France

Postby lar282 on Tue Jan 16, 2007 10:33 am

well, I have a second imdb spider that takes of the rest and that works great.

What if I just use the below. Can u explain WHT! that doesn't work?(just for test and for me to understand)
//------------------------------------------------------------
url=http://us.imdb.com/title/tt0489270/

results=http://us.imdb.com/title/tt0489270/
<title>(?<display>.*?)\(.*?\)</title>

<meta name="title" content="(?<title>.*?)\(.*?\)"><meta

//Plot
<b class="ch">Plot Outline:</b> (?<plot>.*?)?. <a href="

// Coverart
<a name="poster" .*? title=".*?" src="(?<coverart>.*?)"

<a href="/Sections/Genres/.*?">(?<genre>.*?)</a>

<a href="/mpaa">MPAA</a>:</b>(?<rating>.*?)<br>

<b class="ch">Runtime:</b>(?<runtime>.*?)<br>

//Actors
// <table cellpadding="1" cellspacing="0"><tr><td colspan="4" align="left"><b class="blackcatheader">Cast overview, first billed only:
first billed only: </b></td></tr> (?<variable><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr>)
<td valign="top"><a href="/name/.*?">(?<actors>.*?)</a></td>


//------------------------------------------------------------
lar282
 
Posts: 1624
Joined: Thu Apr 01, 2004 4:13 pm
Location: Helsingborg, Sweden

Postby dgemily on Tue Jan 16, 2007 2:30 pm

I need some free time to test it .... then to tell you....
I think your problem is in the line "results"....
and I just moved to another city and I don't have an internet connection at home for the moment....
dgemily
 
Posts: 793
Joined: Thu May 13, 2004 6:24 am
Location: Paris, France

Postby lar282 on Tue Jan 16, 2007 2:32 pm

I appriciate all the time/help u can give me.

Maybe, just maybe,Steven will check in soon........then he can help me.

//Lasse
lar282
 
Posts: 1624
Joined: Thu Apr 01, 2004 4:13 pm
Location: Helsingborg, Sweden

Postby tswhite70 on Tue Jan 16, 2007 4:57 pm

I may be able to help. The current spider architecture requires the "results=" line with the (?<url>) and (?<display>) regexes which is why your two examples do not work - a "result" has to be provided for Xlobby to process the spider.

Your idea of writing a separate spider for IMDB for the specific instance where only a single result is returned by the search (automatically redirecting to that movie detail page) definitely has merit. I never considered doing this before, it's a great idea.

I'm not able to test right now but I think the following may work - the only thing changed from the first example you gave is the "results=" code:

Code: Select all
url=http://us.imdb.com/find?s=all&q=%searchstring%;tt=1

results=<strong class="title">(?<display>.*?)\s?<small>.*?<td><a href=".*pro.imdb.com(?<url>.*?)">

<meta name="title" content="(?<title>.*?)\(.*?\)"><meta

//Plot
<b class="ch">Plot Outline:</b> (?<plot>.*?)?. <a href="

// Coverart
<a name="poster" .*? title=".*?" src="(?<coverart>.*?)"

<a href="/Sections/Genres/.*?">(?<genre>.*?)</a>

<a href="/mpaa">MPAA</a>:</b>(?<rating>.*?)<br>

<b class="ch">Runtime:</b>(?<runtime>.*?)<br>

//Actors
// <table cellpadding="1" cellspacing="0"><tr><td colspan="4" align="left"><b class="blackcatheader">Cast overview, first billed only:
first billed only: </b></td></tr> (?<variable><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr>)
<td valign="top"><a href="/name/.*?">(?<actors>.*?)</a></td>


Note this is the a single instance spider, it will not work for movie searches that return multiple results, you will still need the original IMDB spider for that.

Oh yeah, if you copy the code above make sure you remove any trailing spaces from each line before you save the spider.

good luck,
tsw
tswhite70
 
Posts: 318
Joined: Tue Jan 06, 2004 3:44 pm
Location: Houston, Tx

Postby dgemily on Tue Jan 16, 2007 6:08 pm

yeah, the real specialist is here ;)

thx tswhite70
dgemily
 
Posts: 793
Joined: Thu May 13, 2004 6:24 am
Location: Paris, France

Postby lar282 on Tue Jan 16, 2007 6:16 pm

Thank u for the help.

It now works if I do the search from the F2, but not in the movie screen

really wierd, but I guess I am one step closer. Prob is that I have noooo clue to fix it.


U have a few more minutes to spend on this?


//Lasse
lar282
 
Posts: 1624
Joined: Thu Apr 01, 2004 4:13 pm
Location: Helsingborg, Sweden

Postby tswhite70 on Tue Jan 16, 2007 7:13 pm

I can't think of any reason why it would work from F2 but not from your Movie screen. Are you sure you modified your Movie screen spider button events to point to the right spider - ie did you name the new single-instance spider something like "DVD - IMDB.com Single.txt" and the movie screen is still pointed to the original "DVD - IMDB.com.txt" spider? If that's not the problem then I'll have to play around with it a little tonight when I get home.

Or is it maybe that you can't see the result because there is only 1, I wouldn't think that would be a problem but I've never searched for just movie info via a screen based spider - always coverart. Try the following to add coverart to the spider to see if that makes a difference:
Code: Select all
<a name="poster" href=".*?" title=.*><img border="0" alt=.*? title=.*? src="(?<coverart>http://ia.ec.imdb.com/.*?)".*?></a>
replace=coverart:m.jpg:f.jpg


dgemily - you give me too much credit, I'm just following your great example...

tsw
tswhite70
 
Posts: 318
Joined: Tue Jan 06, 2004 3:44 pm
Location: Houston, Tx

Postby lar282 on Wed Jan 17, 2007 7:22 am

that didn't help either. I already had coverart

my old code:
<a name="poster" .*? title=".*?" src="(?<coverart>.*?)"


The name of the txt file is correct. My regular one is named

dvd - imdb.com.Regular.txt
and my direct one is named
dvd - imdb.com.Direct.txt


I think u are on to something about just one hit. Man I wish Steven was here!


So code for regular one is like this and works.
//----------------------------------------------------------------
url=http://us.imdb.com/find?s=all&q=%searchstring%;tt=1
results=<a href="(?<url>/title/.*?/).*?">(?<display>.*?)</a>

// Now goto that new url that stores all info on imdb
lasse=http://www.imdb.com%url%

<title>(?<display>.*?)\((?<year>.*?)\)</title>
//<a name="poster" .*? title="(?<title>.*?)">

//Plot
<b class="ch">Plot Outline:</b> (?<plot>.*?)?. <a href="

// Coverart
<a name="poster" .*? title=".*?" src="(?<coverart>.*?)"

<a href="/Sections/Genres/.*?">(?<genre>.*?)</a>

<a href="/mpaa">MPAA</a>:</b>(?<rating>.*?)<br>

<b class="ch">Runtime:</b>(?<runtime>.*?)<br>

//Actors
// <table cellpadding="1" cellspacing="0"><tr><td colspan="4" align="left"><b class="blackcatheader">Cast overview, first billed only:
first billed only: </b></td></tr> (?<variable><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr>)
<td valign="top"><a href="/name/.*?">(?<actors>.*?)</a></td>

//----------------------------------------------------------------

Direct one is like this and only work through F2.
//----------------------------------------------------------------
url=http://us.imdb.com/find?s=all&q=%searchstring%;tt=1

results=<strong class="title">(?<display>.*?)\s?<small>.*?<td><a href=".*pro.imdb.com(?<url>.*?)">
<a name="poster" href=".*?" title=.*><img border="0" alt=.*? title=.*? src="(?<coverart>http://ia.ec.imdb.com/.*?)".*?></a>
replace=coverart:m.jpg:f.jpg
<meta name="title" content="(?<title>.*?)\(.*?\)"><meta

//Plot
<b class="ch">Plot Outline:</b> (?<plot>.*?)?. <a href="

// Coverart
//<a name="poster" .*? title=".*?" src="(?<coverart>.*?)"

<a href="/Sections/Genres/.*?">(?<genre>.*?)</a>

<a href="/mpaa">MPAA</a>:</b>(?<rating>.*?)<br>

<b class="ch">Runtime:</b>(?<runtime>.*?)<br>

//Actors
// <table cellpadding="1" cellspacing="0"><tr><td colspan="4" align="left"><b class="blackcatheader">Cast overview, first billed only:
first billed only: </b></td></tr> (?<variable><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr><tr>.*?</tr>)
<td valign="top"><a href="/name/.*?">(?<actors>.*?)</a></td>

//----------------------------------------------------------------
//Lasse
lar282
 
Posts: 1624
Joined: Thu Apr 01, 2004 4:13 pm
Location: Helsingborg, Sweden

Postby tswhite70 on Wed Jan 17, 2007 3:20 pm

Yeah, I worked on it last night for a little while - I think Steven is the only one who will be able to answer this. I tried several things including making sure the regexes for "result=" were in the more traditional order of (?<url>) (?<display>) (just in case Steven was using separate code that used match position instead of name) without any luck. Obviously the on screen spider code is a little different from F2 (it pulls the results and processes each match immediately, instead of processing the match when you click on the relevant result). Maybe the onscreen code is dropping the first match, which is very easy to do with a fatfinger on the loop "for i=1 to X" instead of "for i=0 to X" or "if" statements..... you know what I mean.

For coverart - that's the same regex, yours is just a little prettier ;) Make sure you do the "replace" if you want the higher res coverart - low res always seems to be "**m.jpg" and the highres is "**f.jpg".

good luck,
tsw
tswhite70
 
Posts: 318
Joined: Tue Jan 06, 2004 3:44 pm
Location: Houston, Tx

Postby dgemily on Wed Jan 17, 2007 4:39 pm

tswhite70 wrote: Obviously the on screen spider code is a little different from F2 (it pulls the results and processes each match immediately, instead of processing the match when you click on the relevant result). Maybe the onscreen code is dropping the first match, which is very easy to do with a fatfinger on the loop "for i=1 to X" instead of "for i=0 to X" or "if" statements..... you know what I mean.


Lasse, maybe you can try to put at the top of your spider those 2 lines:

Code: Select all
downloadresultsonly
limit=0


later
dgemily
 
Posts: 793
Joined: Thu May 13, 2004 6:24 am
Location: Paris, France

Postby lar282 on Wed Jan 17, 2007 7:25 pm

no difference
What do these options do?
and where is Steven?

//Lasse
lar282
 
Posts: 1624
Joined: Thu Apr 01, 2004 4:13 pm
Location: Helsingborg, Sweden

Postby dgemily on Wed Jan 17, 2007 8:58 pm

those are parameters that I have asked for ;)

"downloadresultsonly" is a parameter to only donwload the result list ( without information, then when your select one result, it will download information for this item)
it is a parameter to skip the problem that tswhite70 was speaking about...
Obviously the on screen spider code is a little different from F2 (it pulls the results and processes each match immediately, instead of processing the match when you click on the relevant result).


"limit="
it's a parameter to limit the numbers of items in the result list. if limit=0 there will have no limit.

if you don't use this parameter, you will be limited to 6 items result by spiders used.

and about steven, I can't help you.....

later

---------------------------edit-----------------------------
tsw or lasse did you ever try my spiders for online meal recipe ? ( when they were working... before the last bug)
dgemily
 
Posts: 793
Joined: Thu May 13, 2004 6:24 am
Location: Paris, France