tswhite70, would you check my spider?

title says it all

tswhite70, would you check my spider?

Postby freebits on Sun Apr 02, 2006 4:03 pm

Hello,

I was able to create my own spiders which extract music coverart from my own language site.

The thing is I could obtain small coverart but, big one is not easy for me.

Would you check my spider and advise me what I should do to retrieve the bigger picture from my site.

my url is http://www.yes24.com and it's Korean website.

The following is my spider....

url=http://www.yes24.com/searchCenter/searchResult.aspx?keywordAd=&qdomain=%C0%BD%B9%DD&query=%searchstring%

results=<a href='(?<url>/Goods/FTGoodsView.aspx\?.*?)' .*?><b>(?<display>.*?)</a>

src='(?<coverart>http://image.yes24.com/momo/.*?.jpg)'


Thanks in advance,
Freebits
freebits
 
Posts: 36
Joined: Tue May 25, 2004 12:50 pm
Location: Seoul

Postby tswhite70 on Mon Apr 03, 2006 4:24 pm

Freebits - I took a look at it, tough for me since I don't read Korean, but Babelfish helped me out some. :D

It looks like the image link you actually want is just to the right of the small image link on the details page in a javascript reference. Looking at the following details page:
http://www.yes24.com/Goods/FTGoodsView. ... 3001001010

The small image link you are getting in your spider is:
http://image.yes24.com/momo/TopCate49/M ... 802649.jpg

The large image link is included in the javascript reference to the right of the small link in the details source and looks like this:
onClick='javascript:ZoomWindow("./FTGoodsFileView.aspx?goodsNo=1969087&CategoryNumber=003001001010&ImgUrl=/TopCate49/MidCate01/4802648.jpg"); return

The section in bold is the url of the large image. I looked at a couple different albums and it appears the large image link is always numerically one less that the small link:

small img: 4802649.jpg
large img: 4802648.jpg

I don't think we can do anything with that though since the final number is different for different albums and we can't tell Xlobby to subtract 1 from the URL :( . I haven't tested this (I'm not at home today), but I think it may work:

(?<coverart>ImgUrl=.*?)".; return
replace=coverart:ImgURL=:http://image.yes24.com/momo

So we pull the link portion from the javascript reference and then replace the "ImgURL=" with "http://image.yes24.com/momo". I'm not sure what Steven's code will do with the ":" in "http://image", hopefully it won't assume it's another variable in the replace statement.

The other option if the replace doesn't work is to use the full javascript reference as a URL call and then get the link direct from the javascript page, something like:

onClick.'javascript:ZoomWindow."(?<url>.*?)".; return
<span id="LB_CENTER_IMG"><img src='(?<coverart>.*?)' onClick=

EDIT: The second option worked, I've posted the spider in the "Spiders" post so others can find it:
http://www.xlobby.com/forum/viewtopic.php?p=27121#27121

Freebits - let me know if it works for you!

good luck,
tsw
tswhite70
 
Posts: 318
Joined: Tue Jan 06, 2004 3:44 pm
Location: Houston, Tx

Great, great, great!!!

Postby freebits on Tue Apr 04, 2006 1:18 am

Hello tsw,

What a great work!!! Thanks alot!!! :D :D

The hardest party for me was also how to replace the number in image file name. You avoid this problem and did a great job!!! Thank you.

I'm making a Korean verson of Xlobby package with language localization, local spiders, Ant movie catalog with Korean scripts, etc. your help boosts my job alot!!

Regards,
Freebits
freebits
 
Posts: 36
Joined: Tue May 25, 2004 12:50 pm
Location: Seoul