java - How do I datamine financial tables using Htmlunit? -


using java/htmlunit want data mine (web scrape) bunch of hedge fund sec 13f filings. have no clue how datamine sec's .txt files such this table. table layout seems clean , structured, how grab < table > corresponding < s > , < c >? moreover, how can grab company names , < c > value (in column 3) , < c > shares amt (in column 4).

not sure if i'm on right track, used bufferedreader, not sure next grab data within < table > ... here's have far:

import java.io.bufferedreader; import java.io.ioexception; import java.io.inputstreamreader; import java.net.malformedurlexception; import java.net.url;  public class bufferedreaderexample {  public static void main(string[] args) {      try {         // create url desired page         url url = new url("http://www.sec.gov/archives/edgar/data/1047644/000104746912006072/a2209520z13f-hr.txt");         bufferedreader in = new bufferedreader(new inputstreamreader(url.openstream()));         string str;           while ((str = in.readline()) != null) {             system.out.println(str);         }         in.close();     } catch (malformedurlexception e) {     } catch (ioexception e) {     } } } 

i don't know kind of format document has htmlunit will, @ most, allow download web. you'll have parsing on own.

now, format doesn't seem xml, html nor standard format (at least small amount know)... so, first thought regular expressions after second thought realised you've got length of columns represented amount of dashes (-).

you can use regular expressions between <table> tags , use programming language split dash line array of strings , cut text of each line below amount of characters of each of strings.

that'd :)


Comments

Popular posts from this blog

django - How can I change user group without delete record -

java - Need to add SOAP security token -

java - EclipseLink JPA Object is not a known entity type -