List Files On Http/ftp Server In R
I'm trying to get list of files on HTTP/FTP server from R!, so that in next step I will be able to download them (or select some of files which meet my criteria to download). I kno
Solution 1:
You really shouldn't use regex on html. The XML
package makes this pretty simple. We can use getHTMLLinks()
to gather any links we want.
library(XML)
getHTMLLinks(result)
# [1] "Interesting file_20150629.txt""Interesting file_20150630.txt"# [3] "Interesting file_20150701.txt""Interesting file_20150702.txt"# [5] "Interesting file_20150703.txt""Interesting file_20150704.txt"# [7] "Interesting file_20150705.txt""Interesting file_20150706.txt"# [9] "Interesting file_20150707.txt""Interesting file_20150708.txt"# [11] "Interesting file_20150709.txt"
That will get all /@href
links contained in //a
. To grab only the ones that contain.txt
, you can use a different XPath query from the default.
getHTMLLinks(result, xpQuery = "//a/@href[contains(., '.txt')]")
Or even more precisely, to get those files that end with .txt
, you can do
getHTMLLinks(
result,
xpQuery = "//a/@href['.txt'=substring(., string-length(.) - 3)]"
)
Solution 2:
An alternative without loading additional libraries is to turn ftp.use.epsv=FALSE and crlf = TRUE. This will instruct libcurl to change \n's to \r\n's:
require("RCurl")
result <- getURL("http://server",verbose=TRUE,ftp.use.epsv=FALSE, dirlistonly = TRUE, crlf = TRUE)
Then extract the individual URLs to the files using paste and strsplit,
result2 <- paste("http://server", strsplit(result, "\r*\n")[[1]], sep = "")
Post a Comment for "List Files On Http/ftp Server In R"