How to write custom spider for site: www.indiancdstore.com

Speak your mind

How to write custom spider for site: www.indiancdstore.com

Postby tcj2001 on Fri Sep 23, 2005 3:25 pm

Anybody has any info to write custom spider for http://www.indiancdstore.com
tcj2001
 
Posts: 19
Joined: Wed Sep 21, 2005 3:08 pm

Postby hvs69 on Fri Sep 23, 2005 7:32 pm

I have tried writing spider for nehaflix.com. I am only partially successful. I can only get the matches to show up. But, I still cannot extract information such as actors, director, coverart etc. Part of the problem is that it is hard to debug your expressions since Xlobby is not the best tool for debugging.

I have asked for some help in the past but did not get any reply. I am assuming that only Steven knows how to write spiders and he is porbably too busy to write a tutorial on this topic.

If you are good with 'regular expression', maybe we can work on it together and sort it out.
hvs69
 
Posts: 219
Joined: Wed Feb 11, 2004 8:06 am

Postby tcj2001 on Fri Sep 23, 2005 8:35 pm

Could you post you spider txt for nehaflix.com, may be i can learn something from it.
tcj2001
 
Posts: 19
Joined: Wed Sep 21, 2005 3:08 pm

Postby hvs69 on Mon Sep 26, 2005 6:28 pm

tcj2001 wrote:Could you post you spider txt for nehaflix.com, may be i can learn something from it.


Here is the code I have for Nehaflix.com

Code: Select all
url=http://search.store.yahoo.com/cgi-bin/nsearch?catalog=nehaflix&query=%searchstring%

results=<TD COLSPAN=2><FONT SIZE=3><A HREF="(?<url>http://store.yahoo.com/nehaflix/.*?)"><b>(?<display>.*?)</b>

//find actors
nsearc\?cat.*?>(?<actors>.*?)</a>



It can find the matching movie names, but does not fetch <actors> field. Take a look at it and study the amazon spider. Let me know if you can debug it.

Steve, if you are reading this thread, please help us in debugging.

Thanks.
hvs69
 
Posts: 219
Joined: Wed Feb 11, 2004 8:06 am

Postby hvs69 on Mon Sep 26, 2005 7:30 pm

Well, I tried for indiancdstore.com as well. I am getting the same results.

Here is the code:

Code: Select all
url=http://www.indiancdstore.com/ssearch.asp?keywords=%searchstring%&x=0&y=0

results=<div align='center'><center><table border='0' width='95%' cellspacing='0' cellpadding='0'><tr><td width='100%' colspan='2' bgcolor='#1E4CA9' ><strong><small><font color='#FFFFFF' face='Arial' .*? <a href='(?<url>viewdetails.asp.*?)'>(?<display>.*?)</a>


//find actors
CAST.*?000.*?> (?<actors>.*?) </font>



I think I need some explanation on the usage of ?<url> tag.

Is my current <url> expression correct ?
which URL is the spider looking at when searching for <actors> ?
hvs69
 
Posts: 219
Joined: Wed Feb 11, 2004 8:06 am

Postby tcj2001 on Tue Sep 27, 2005 2:58 pm

I am actually not able to find a pattern in the existing spider.txt's. I think we need somebody's help to code this spider.
tcj2001
 
Posts: 19
Joined: Wed Sep 21, 2005 3:08 pm

Success

Postby tcj2001 on Mon Oct 03, 2005 9:35 pm

Got it finally working (Spider for http://www.nehaflix.com) Hindi Music

url=http://search.store.yahoo.com/cgi-bin/nsearch?catalog=nehaflix&query=%searchstring%
results=<TD COLSPAN=2><FONT SIZE=3><A HREF="(?<url>http://store.yahoo.com/nehaflix/.*?)"><b>(?<display>.*?)</b></A></TD>

href="(?<coverart>http://store1.yimg.com/I/nehaflix_.*?)"
tcj2001
 
Posts: 19
Joined: Wed Sep 21, 2005 3:08 pm

Re: Success

Postby hvs69 on Tue Oct 04, 2005 4:52 pm

kewl. I will look into it. Now that you are pretty good at it, can you please do it for DVD with actor/director/plot fields. :D

Many Thanks for posting

tcj2001 wrote:Got it finally working (Spider for http://www.nehaflix.com) Hindi Music

url=http://search.store.yahoo.com/cgi-bin/nsearch?catalog=nehaflix&query=%searchstring%
results=<TD COLSPAN=2><FONT SIZE=3><A HREF="(?<url>http://store.yahoo.com/nehaflix/.*?)"><b>(?<display>.*?)</b></A></TD>

href="(?<coverart>http://store1.yimg.com/I/nehaflix_.*?)"
hvs69
 
Posts: 219
Joined: Wed Feb 11, 2004 8:06 am

Postby hvs69 on Tue Oct 04, 2005 8:08 pm

tcj,

Taking a clue from your posted spider. I have managed to modify your code to include <plot>, <genre>, <director> fields. I cannot get the <actors> to work yet because of the cross-referencing http links for each actor. Let me know if you can make it work. Here is my code:

Code: Select all
url=http://search.store.yahoo.com/cgi-bin/nsearch?catalog=nehaflix&query=%searchstring%
results=<TD COLSPAN=2><FONT SIZE=3><A HREF="(?<url>http://store.yahoo.com/nehaflix/.*?)"><b>(?<display>.*?)</b>

href="(?<coverart>http://store1.yimg.com/I/nehaflix_.*?)"

Director:.*?>(?<director>.*?)<

<b>Genre:.*?>(?<genre>.*?)<p>

<p>Synopsis.*?>(?<plot>.*?)<p>
hvs69
 
Posts: 219
Joined: Wed Feb 11, 2004 8:06 am

Postby tcj2001 on Tue Oct 04, 2005 10:05 pm

Yes getting actor is difficult as there is hyper links on each of them.

I could get the actors in different fields (upto 5) using this code

<b>Starring:</b> <a href.*?">(?<actor1>.*?)</a>.*?">(?<actor2>.*?)</a>.*?">(?<actor3>.*?)</a>.*?">(?<actor4>.*?)</a>.*?">(?<actor5>.*?)</a>.*?

But problem is if there less than 5 actors then you get garbage in the remaining actor fields or reduce it to get only 2 actors since every film will have atleast 2 actor

or we have to get everything including the hyperlink using this code
<b>Starring:</b>(?<actors>.*?)</a><p><b>Subtitles:</b>
then use some REPLACE operation to delete the hyperlink data in the field, I dont know how to do this.

I use
<p>Synopsis by Nehaflix.com.*?<p>(?<plot>.*?)<p><i><b>Code:
to get plot data
tcj2001
 
Posts: 19
Joined: Wed Sep 21, 2005 3:08 pm

Postby hvs69 on Tue Oct 04, 2005 10:26 pm

Thanks.

Did you get the indiancdstore spider working yet. I am still having same problems with it.
hvs69
 
Posts: 219
Joined: Wed Feb 11, 2004 8:06 am

To find Actors in www.nehaflix.com

Postby tcj2001 on Wed Oct 05, 2005 10:28 pm

Use this expresion in http://www.nehaflix.com to find all actors

<b>Starring:</b>(?<variable>.*?)</a><p><b>Subtitles:</b>
<a href=".*?">(?<actors>.*?)</a>

you should understand the concept of getting data into a variable and using regex to search inside the variable


Complete Nehaflix spider.txt
url=http://search.store.yahoo.com/cgi-bin/nsearch?catalog=nehaflix&query=%searchstring%
results=<TD COLSPAN=2><FONT SIZE=3><A HREF="(?<url>http://store.yahoo.com/nehaflix/.*?)"><b>(?<display>.*?)</b></A></TD>

<b>Starring:</b>(?<variable>.*?)</a><p><b>Subtitles:</b>
<a href=".*?">(?<actors>.*?)</a>

<p>Synopsis by Nehaflix.com.*?<p>(?<plot>.*?)<p><i><b>Code:
href="(?<coverart>http://store1.yimg.com/I/nehaflix_.*?)"
tcj2001
 
Posts: 19
Joined: Wed Sep 21, 2005 3:08 pm

Spider for IndianCdStore

Postby tcj2001 on Wed Oct 05, 2005 10:31 pm

url=http://www.indiancdstore.com/ssearch.asp?keywords=%searchstring%&x=4&y=6
results=<a href='(?<url>viewdetails.asp.*?)'class='l'>(?<display>.*?)</a>

CAST<br>(?<variable>.*?)</tr>
color="#000000"> (?<Cast>.*?) </font>

<img src="(?<coverart>images/.*?)" width="100" height="140" border="1"></font></td>

replace=coverart:images3:images3
replace=coverart:images2:images3,images2
replace=coverart:images:images3,images2,images


In this I have show how to get cast information, u can use the same concept to get other information.

Please work on it get other informations and post the final spider, I am too lazy to do the rest.

Took 1 full day to debug and build this spiders...Enjoy
tcj2001
 
Posts: 19
Joined: Wed Sep 21, 2005 3:08 pm