탁탁탁 네이버 디렉터리를 뒤져보자..
일단 카테고리 목록 추출을 해보자..
서브 카테고리 스캔:
static void Main(string[] args)
{
WebClient browser = new WebClient();
string commonPrefix = "/News";
string text = browser.DownloadString("http://dir.naver.com" + commonPrefix);
Regex regex = new Regex("<a href=\\\"(.*?)\\\">(.*?)<\\/a>");
MatchCollection matches = regex.Matches(text, 0);
foreach (Match match in matches)
{
string link = match.Groups[1].Value;
string name = match.Groups[2].Value;
if (!link.StartsWith(commonPrefix) || link.Equals(commonPrefix))
continue;
Console.WriteLine(name + " -> " + link);
}
}
추출된 카테고리 결과:
뉴스 속보 -> /News/Current_News
사설, 칼럼 -> /News/Columns
일기예보, 날씨 -> /News/Weather
TV방송 -> /News/Broadcasting_Station
디지털 멀티미디어 방송 -> /News/DMB
라디오방송 -> /News/Radio
위성방송 -> /News/Satellite
인터넷 프로토콜 텔레비전 -> /News/iptv
인터넷방송 -> /News/Internet_Broadcasting
케이블TV -> /News/Cable_TV
신문 -> /News/Newspapers
잡지, 웹진 -> /News/Magazines
외국미디어 -> /News/World
통신사 -> /News/News_Agency
방송장비 -> /News/News_and_Media
저널리즘 -> /News/Journalism
협회, 단체 -> /News/Organizations
URL 목록은 아래처럼 뽑으면 되겠다..
URL 스캐너:
private void GetUrls(string directory)
{
WebClient browser = new WebClient();
string text = browser.DownloadString("http://dir.naver.com" + "/News/Weather/Today");
Regex regex = new Regex("<a href=\\\"(.*?)\\\".*?target=_blank.*?>(.*?)<\\/a>");
MatchCollection matches = regex.Matches(text, 0);
foreach (Match match in matches)
{
string link = match.Groups[1].Value;
string name = match.Groups[2].Value;
if (name.Contains(".naver.com") || link.Contains(".naver.com"))
continue;
Console.WriteLine(name + " -> " + link);
}
}
추출된 URL 결과:
동네예보 -> http://www.digital.go.kr/
날씨닷컴 -> http://www.nalsee.com/
날씨 ON -> http://www.weather.kr/
케이웨더 630 -> http://www.630.co.kr/
케이웨더 630 -> http://www.630.co.kr/
W365닷컴 -> http://www.w365.com/
웨더뉴스 날씨 -> http://www.weathernews.co.kr/
Weather Wiz Kids -> http://www.weatherwizkids.com/
기상청 날씨정보 -> http://www.kma.go.kr/weather/main.jsp
네이버 날씨 -> http://weather.naver.com/
이제 재귀적으로 처리하면 카테고리와 매칭된 URL을 얻을 수 있겠다.. 그런데 네이버도 중복 등록하는 실수를 하는 것을 보니 아무래도 중복 검사도 해야겠군..
살짝 기존 메소드의 리턴 타입을 바꾸고 아래와 같이 하면..
재귀적 디렉터리 스캐너:
public void Run()
{
IList<CategorizedHyperlink> links = new List<CategorizedHyperlink>();
BuildRecursively("/News/Weather", links);
foreach (CategorizedHyperlink link in links)
{
Console.WriteLine(link.Category + ": " + link.Hyperlink.Name + " -> " + link.Hyperlink.Link);
}
Console.WriteLine("Total: " + links.Count);
}
private void BuildRecursively(string parent, IList<CategorizedHyperlink> links)
{
IList<Hyperlink> categories = GetCategories(parent);
foreach (Hyperlink link in categories)
{
BuildRecursively(link.Link, links);
}
IList<Hyperlink> urls = GetUrls(parent);
foreach (Hyperlink link in urls)
{
links.Add(new CategorizedHyperlink { Category = parent, Hyperlink = link });
}
}
재귀적 디렉터리 스캔 결과:
/News/Weather/Life/Mountain_and_Valley: 국립공원 산악날씨 -> http://www.w365.com/korea/kor/life/w365_mount01_1.html/
/News/Weather/Life/Mountain_and_Valley: 도립공원 산악날씨 -> http://www.w365.com/korea/kor/life/w365_mount01_2.html/
/News/Weather/Life/Sunrise_and_sunset: 월별 해달출몰시각 - 한국천문연구원 -> http://www.kasi.re.kr/Knowledge/sunmoon_map.aspx/
/News/Weather/Life/Sunrise_and_sunset: 일월출몰 - W365 -> http://www.w365.com/korea/kor/sunm/sun_monn.php/
/News/Weather/Life: 농업기상정보시스템 -> http://weather.rda.go.kr/
/News/Weather/Life: El Tiempo -> http://www.eltiempo.es/
/News/Weather/World/North_America: The Climate Registry -> http://www.theclimateregistry.org/
/News/Weather/World/Asia: WNI Weathernews -> http://www.weathernews.jp/
/News/Weather/World/Asia: 일본기상협회 -> http://www.tenki.or.jp/
/News/Weather/World/Asia: 중국기상정보넷 -> http://www.weathercn.com/
/News/Weather/World/Asia: WEATHER EYE -> http://www.weather-eye.com/
/News/Weather/World/Asia: 코우치 대학 기상 정보페이지 -> http://weather.is.kochi-u.ac.jp/
/News/Weather/World/Asia: 안휘성기상 -> http://www.ahqx.gov.cn/
/News/Weather/World/Asia: Life & Business Weather -> http://tenki.lbw.jp/
/News/Weather/World/Asia: 청해기상 -> http://www.qhqxj.gov.cn/
/News/Weather/World/Asia: 귀주성기상 -> http://www.121net.com.cn/
/News/Weather/World/Asia: 중정기상네트워크 -> http://www.121.cq.cn/
/News/Weather/World/Oceania: Weather Austrian -> http://www.weather.com.au/
/News/Weather/World/Oceania: WeatherZone -> http://www.weatherzone.com.au/
/News/Weather/World/Oceania: 야후 날씨 뉴질랜드 -> http://nz.weather.yahoo.com/
/News/Weather/World/Europe: Wetter de -> http://wetter.rtl.de/
/News/Weather/World/Europe: 유럽기상서비스네트워크 -> http://www.meteoalarm.eu/
/News/Weather/World: Wunderground -> http://www.wunderground.com/
/News/Weather/World: 세계기상 - 기상청 -> http://www.kma.go.kr/world/world_01.jsp
/News/Weather/World: World Climate -> http://www.worldclimate.com/
/News/Weather/World: 웨더버그 -> http://weather.weatherbug.com/
/News/Weather/World: WeatherBase -> http://www.weatherbase.com/
/News/Weather/World: Foreca -> http://www.foreca.com/
/News/Weather/World: Intellicast -> http://www.intellicast.com/
/News/Weather/World: Metcheck -> http://www.metcheck.com/
/News/Weather/World: Weather Reports -> http://www.weatherreports.com/
/News/Weather/World: 유니시스 날씨 -> http://www.weather.unisys.com/
/News/Weather/Today: 동네예보 -> http://www.digital.go.kr/
/News/Weather/Today: 날씨닷컴 -> http://www.nalsee.com/
/News/Weather/Today: 날씨 ON -> http://www.weather.kr/
/News/Weather/Today: 케이웨더 630 -> http://www.630.co.kr/
/News/Weather/Today: 케이웨더 630 -> http://www.630.co.kr/
/News/Weather/Today: W365닷컴 -> http://www.w365.com/
/News/Weather/Today: 웨더뉴스 날씨 -> http://www.weathernews.co.kr/
/News/Weather/Today: Weather Wiz Kids -> http://www.weatherwizkids.com/
/News/Weather/Today: 기상청 날씨정보 -> http://www.kma.go.kr/weather/main.jsp
/News/Weather/Forecast/Aviation: 항공기상청 -> http://kama.kma.go.kr/
/News/Weather/Forecast/Aviation: 공항예보 - 항공기상청 -> http://kama.kma.go.kr/kama/wsub02/internal_05.jsp
/News/Weather/Forecast: 동네예보 -> http://www.digital.go.kr/
/News/Weather/Forecast: 날씨닷컴 -> http://www.nalsee.com/
/News/Weather/Forecast: 날씨 ON -> http://www.weather.kr/
/News/Weather/Forecast: 케이웨더 630 -> http://www.630.co.kr/
/News/Weather/Forecast: W365닷컴 -> http://www.w365.com/
/News/Weather/Forecast: 웨더스타 -> http://www.weatherstar.co.kr/
/News/Weather/Forecast: 웨더뉴스 날씨 -> http://www.weathernews.co.kr/
/News/Weather/Forecast: WindGURU -> http://www.windguru.com/
/News/Weather/Forecast: 기상청 날씨정보 -> http://www.kma.go.kr/weather/main.jsp
/News/Weather/Regional/Gyeongsangbukdo: 고령군 방재기상정보 -> http://soback.kornet.net/~pvkbys/gr/
/News/Weather/Regional/Gyeongsangbukdo: 성주군 방재기상정보 -> http://soback.kornet.net/~pvkbys/sj/
/News/Weather/Regional/Gyeongsangbukdo: 경상북도 방재기상정보 -> http://soback.kornet.net/~pvkbys/do/
/News/Weather/Regional/Gyeongsangbukdo: 구미시 방재기상정보 -> http://soback.kornet.net/~pvkbys/gm/
/News/Weather/Regional/Gyeongsangbukdo: 군위군 방재기상정보 -> http://soback.kornet.net/~pvkbys/gw/
/News/Weather/Regional/Gyeongsangbukdo: 경주시 방재기상정보 -> http://soback.kornet.net/~pvkbys/kj/
/News/Weather/Regional/Gyeongsangbukdo: 봉화군 방재기상정보 -> http://soback.kornet.net/~pvkbys/bh/
/News/Weather/Regional/Gyeongsangbukdo: 안동시 방재기상정보 -> http://soback.kornet.net/~pvkbys/ad/
/News/Weather/Regional: 동네예보 -> http://www.digital.go.kr/
/News/Weather/Regional: 지역별 주간날씨 -> http://www.ehyundai.com/home/weather/customer/weather_week.jsp
/News/Weather/Regional: 호남악기상정보센터 -> http://hcis.kma.go.kr/
/News/Weather: 동네예보 -> http://www.digital.go.kr/
/News/Weather: 날씨닷컴 -> http://www.nalsee.com/
/News/Weather: 날씨 ON -> http://www.weather.kr/
/News/Weather: 케이웨더 630 -> http://www.630.co.kr/
/News/Weather: W365닷컴 -> http://www.w365.com/
/News/Weather: 웨더뉴스 날씨 -> http://www.weathernews.co.kr/
/News/Weather: 템포이탈리아 -> http://www.tempoitalia.it/
/News/Weather: 기상청 날씨정보 -> http://www.kma.go.kr/weather/main.jsp
Total: 71
날씨만 해도 이렇게 많다니 흠..
이제 CSV 파일 출력할 준비만 하면 네이버 디렉터리 스캔 준비는 끝..
CSV 익스포트:
private void ExportCsv(string name, IList<CategorizedHyperlink> links)
{
StreamWriter writer = new StreamWriter(name.Replace("/", "$") + ".csv", true, Encoding.Default);
foreach (CategorizedHyperlink link in links)
{
writer.WriteLine(link.Category + "," + link.Hyperlink.Link + "," + link.Hyperlink.Name);
}
writer.Close();
}
역시 C#이 짱이야.. 1시간도 안 걸린듯..
뉴스 URL 모음 파일: News.csv
해놓고 보니 페이징을 고려 안 했는데 그건 나중에 필요하면 생각해봐야지..




