An increasing amount of scientific data, information and literature is available on the internet in the form of scientific databases. Unlike their industrial counterparts, on-line scientific databases are typically exposed either as repositories of XML data or in the form of remote procedure calls (RPCs) that can be invoked over the internet as web services.
Of the scientific disciplines, the life sciences are currently by far the most advanced in terms of on-line databases. This is primarily due to the explosion in bioinformatics-related data, most notably DNA and protein sequences, that may now be interrogated over the internet.
The .NET framework was specifically designed to cater for the requirements of on-line databases, XML data and web services. Consequently, on-line scientific databases can be accessed quickly and efficiently from F# programs by leveraging the .NET framework. This chapter describes the use of existing .NET functionality to interrogate two of the most important scientific databases in the life sciences: the Protein Data Bank (PDB) and GenBank.
The PDB provides a variety of tools and resources for studying the structures of bi-ological macromolecules and their relationships to sequence, function, and disease. The PDB is maintained by the Research Collaboratory for Structural Bioinformat-ics (RCSB), a non-profit consortium dedicated to improving the understanding of biological systems function through the study of the 3-D structure of biological macromolecules.
The PDB provides a simple web interface that allows compressed XML files con-taining the data about a given protein to be downloaded over the web. This section describes how the XML data for a given protein may be downloaded and uncompressed using F# functions that are designed to be as generic as reusable as possible. Each protein in the PDB has a four character identifier. The following function converts such an identifier into the explicit URL of the compressed XML data:
> let pdb name = "ftp://ftp.rcsb.org" + "/pub/pdb/data/structures/divided/XML/" + String.sub name 1 2 + "/" + name + ".xml.gz";; val pdb : string -> string
The XML data are compressed in the GZip format. So the XML data for a given protein may be downloaded and uncompressed by combining the pdb
and gunzip
(see section 9.8) and xml_of_stream
(see section 9.9) functions into a single download
function:
> let download protein = let url = pdb protein let request = System.Net.WebRequest.Create(url) let response = request.GetResponse() use stream = response.GetResponseStream() gunzip stream |> xml_of_stream;; val download : string -> System.Xml.XmlDocument
This download
function can be used to obtain the XML data for an individual protein:
> let doc = download "lhxn";; val doc : System.Xml.XmlDocument
The resulting value is stored in the object-oriented representation of XML data used by the .NET platform, referred to as the System. Xml. XmlDocument
type. This native .NET data structure allows XML data to be constructed and dissected using object-oriented idioms.
The default printing of XML data in F# interactive sessions as nested sequences is of little use:
> doc;; val it : System.Xml.XmlDocument = seq [seq []; seq [seq ...
Consequently, an pretty printer for XML data is very useful when dissecting XML data interactively using F#.
The XmlDocument
type represents a whole XML document and includes global metainformation about the data itself and the method of encoding. XML data is essentially just a tree and, ideally, we would like to be able to visualize this tree in interactive sessions and dissect it using pattern matching.
Nodes in an XML tree are derived from the base type XmlNode
. The two most important kinds of node are Element
and Text nodes
. In XML, an Element
node corresponds to a tag such as <br / >
and a Text
node corresponds to verbatim text such as "Title" in <hl>Title</hl>
.
As described in section 2.6.4.3, pretty printers can be installed in a running interactive session to improve the visualization of values of a certain type. In this case, we require a pretty printer for XML documents and nodes:
> fsi.AddPrinter(fun (xml : System.Xml.XmlNode) -> xml.OuterXml);;
Viewing an XML tree in the interactive session now displays the string represen-tation of the XML:
> doc;; val it : System.Xml.XmlDocument = <?xml version="l.0" encoding="UTF-8"?><PDBx:...
This pretty printer makes it much easier to dissect XML data interactively using F#.
XML data could be examined elegantly using pattern matching by translating a .NET XML tree of objects into an F# variant type (as discussed in section 9.9) and then dissecting the tree using pattern matching. However, the creation of an intermediate tree is wasteful.
A much more efficient way to dissect XML data is to use one of the unique features of F# called active patterns or views. Active patterns allow a non-F# data structure, such as the object-oriented .NET representation of an XML node, to be viewed as if it were a native F# variant type. This is achieved by denning active pattern constructors (in this case called Element
and Text
) that present the .NET data structure in a form suitable for pattern matching.
Before defining the active pattern, we define an element
function that extracts the tag name, attributes and child nodes from an Element
node:
> let element (elt : System.Xml.XmlElement) = elt.LocalName, seq { for attrib in elt.Attributes -> attrib.Name, attrib.Value }, seq { for child in elt.ChildNodes -> child };; val element : System.Xml.XmlElement -> string * seq<string * string> * seq<System.Xml.XmlNode>
The following defines two active patterns called Element
and Text
that match the respective kinds of XML node using the : ? construct to detect the run-time type of an object:
> let (|Element|Text|) (n : System.Xml.XmlNode) = match n with | :? System.Xml.XmlElement as n -> Element(element n) | n -> Text n.InnerText;; val (|Element|Text|) : System.Xml.XmlNode -> Choice<(string * seq<string * string> * seq<System.Xml.XmlNode>),string>
The following acids_of_xml
function uses the active pattern in sequence comprehensions to extract the amino-acid sequence from the given XML document for a protein:
> let acids_of_xml (doc : System.Xml.XmlDocument) = let cs = doc.DocumentElement.ChildNodes seq { for Element("entity_poly_seqCategory", _, cs) in cs for Element("entity_poly_seq", attribs, _) in cs for "mon_id", acid in attribs -> acid };; val acids_of_protein : System.Xml.XmlNode -> seq<string>
This is a beautiful example of the power and flexibility of active patterns and the comprehension syntax. The lines of this function represent progressively more fine-grained dissections of the XML document. The entity_poly_seqCategory
tag is extracted first followed by the nested entity_poly_seq
tag and, finally, the attribute called mon_id
that contains the amino acid sequence. By wrapping the whole function in a single sequence comprehension syntax {...}, all of the matching attributes in the nested tag are listed sequentially with minimal effort.
The function can be used in combination with the previous functions to extract the amino-acid sequence of a given PDB protein directly from the on-line database:
> acids_of_xml doc;; val it : seq<string> = seq ["HIS"; "ARG"; "ASN"; "SER"; ...]
This makes F# an extraordinarily powerful tool for the interactive analysis of PDB data. However, this example would not be complete without describing how easily the data structures involved can be visualized in a GUI to make data dissection easier than ever before.
The tree-based representation of XML data is ideally suited to graphical visualization using the ubiquitous Tree View
Windows Forms control. This is the same control used to provide the default tree representation of disc storage on the left hand side of Windows Explorer.
Windows Forms programming makes heavy use of the following namespace:
> open System.Windows.Forms;;
We begin by creating a new Windows form:
> let form = new Form(Visible=true, Text="Protein data");; val form : Form > form.TopMost <- true,-; val form : Form
and adding an empty TreeView
control to it, making sure that the control expands to fill the whole window:
> let tree = new TreeView(Dock=DockStyle.Fill);; val tree : TreeView > form.Controls.Add(tree);; val it : unit = ()
The following function traverses the tree representation of an XML data struc-ture using the active patterns Element
and Text
defined above, accumulating a TreeNode
ready for insertion into the TreeView
:
> let rec treeview_of_xml = function | Text string -> new TreeNode(string) | Element(tag, attribs, cs) ->
Figure 10.1. Using a Windows Forms TreeView
control to visualize the contents of an XML document from the Protein Data Bank.
let parent = new TreeNode(tag) for n in cs do parent .Nodes.Add(treeview_of_xml n) I> ignore parent;; val treeview of xml : XmlNode -> TreeNode > (root : > System .Xml.XmlNode) |> treeview_of_xml |> tree .Nodes .Add;; val it : unit = ()
The result is shown in figure 10.1. Even a minimal GUI such as this can be instru-mental in interactive data dissection, greatly accelerating the development process.
The term "web services" refers to any service provided on-line by a remote computer. Typically, a web service is a programmatic way to interrogate the contents of a database held on the remote server. In some cases, web services are used to combine and process information from various sources.
The Simple Object Access Protocol (SOAP) is by far the most popular way to access web services. A SOAP API consists of a variety of dynamically-typed functions known as remote procedure calls (RPCs) that can be used to request information from a web service. SOAP APIs are typically encapsulated in a single definition using a format known as the Web Service Definition Language (WSDL). A WSDL description of a SOAP API may be automatically compiled into statically- typed function definitions, making it easier to avoid type errors when using web services.
The use of web services in F# revolves around the use of web references in C#, creating a C# DLL and linking to it from an F# program. The process of creating C# DLLs and referencing them from F# code was described in section 2.6.5.
To add a web reference to a C# DLL project, right click on "References" in Solution Explorer and select "Add Web Reference...". Paste the URL of the WSDL file into this window and click "Go". Once the WSDL file has been downloaded and examined, Visual Studio gives an overview of the API described by the file. Adding a WSDL web reference to a project causes Visual Studio to autogenerate thousands of lines of C# source code implementing the whole of the API provided by that web service (the description of which was in the WSDL file). By compiling the autogenerated C# code into a .NET DLL and linking to it from F#, web services can be used from F#.
As a very simple initial example, create a Visual Studio solution composed of a C# DLL and an F# program referencing the DLL (following the description in section 2.6.5) and add a web reference to the URL:
http://www.xmethods.com/sd/TemperatureService.wsdl
Now build the C# DLL and use the #r directive in the F# code to load the C# DLL and then open its namespace:
> #r "ClassLibraryl.dll";; > open ClassLibraryl;;
This web service simply provides a getTemp
function that returns the current temperature in Fahrenheit as a float
in the region of a US zipcode given as a string
.
A new instance of the TemperatureService
class must be created in order to use this web service:
> let server = new net.xmethods. www.TemperatureService();; val server : TemperatureService
This class provides the getTemp
function as a member. Invoking this member function causes the remote procedure call to be made to the SOAP web service hosted at xmethods
. com and should return the temperature in Beverly Hills in the following case:
> server.getTemp("90210");; val it : float32 = 52.Of
This is a minimal example demonstrating the creation of a C# DLL, the use of web references and the interoperability between F# and C# to use some of the C# tools from F# programs.
The National Center for Biotechnology Information (NCBI) was established in 1988 as a national resource for molecular biology information. The NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information.
The NCBI web service enables developers to access Entrez Utilities using SOAP. Programmers may write software applications that access the E-Utilities in F#. The WSDL file describing the E-Utilities SOAP interface is available at the following URL:
http://www.ncbi.nlm.nih.gov/entrez/eutils/soap/eutils.wsdl
This web reference can be made accessible from F# by following the same procedure as before and creating an instance of the eUtilsService
class:
> let serv = new gov.nih.nlm.ncbi.eutils.eUtilsService();; val serv : gov.nih.nlm.ncbi.eutils.eUtilsService
Member functions of this serv
object with names of the form run_e*_MS
may be used to interrogate the NCBI databases in a variety of ways from the Microsoft platform. The NCBI web service is more comprehensive and correspondingly more complicated than the previous examples. These functions essentially ask the database a question and the object returned is loaded with many different forms of metadata, including the parameters of the call itself, as well as the answer to the question.
The run_eInfo_MS
member function of the serv
object can be used to obtain general information regarding the web service:
> let res = serv.run_eInfo_MS("", "", "");; val res : gov.nih.nlm.ncbi.eutils.elnfoResultType
In particular, the DbList
property of the result res
contains a list of the databases available:
> res.DbList;; val it : gov.nih.nlm.ncbi.eutils.DbListType = CSWebServiceClient.gov.nih.nlm.ncbi.eutils.DbListType {items = [|"pubmed"; "protein"; "nucleotide"; "nuccore"; "nucgss"; "nucest"; "structure"; "genome"; "books"; "cancerchromosomes"; "cdd"; "gap"; "domains"; "gene"; "genomeprj"; "gensat"; "geo"; "gds"; "homologene"; "journals"; "mesh"; "ncbisearch"; "nlmcatalog"; "omia"; "omim"; "pmc"; "popset"; "probe"; "proteinclusters"; "pcassay"; "pccompound"; "pcsubstance"; "snp"; "taxonomy"; "toolkit"; "unigene";
"unists"|];}
This list of databases spans a range of different topics within molecular biology, including literature (pubmed), proteins and DNA sequences (nucleotide).
The web service for the ESearch utility allows users to per-form searches on the databases to find database records that satisfy given criteria.
For example, the Protein database can be searched for proteins that have a molec-ular weight of 200,020:
> let res = serv.run_eSearch_MS("protein", "200020[molecular+weight]", "", "", "", "", "", "", "", "", "", "", "0", "15", "", "");; val it : eutils.eSearchResultType
The result res
is an object that encapsulates the IDs of the four proteins found by the search:
> res.IdList;; val it : string array = [| "16766766"; "4758956"; "16422035"; "4104812" |]
The res
object also carries a variety of meta information including the parameters of the search:
> res;; val it : gov.nih.nlm.ncbi.eutils.eSearchResultType = gov.nih.nlm.ncbi.eutils.eSearchResultType {Count = "4"; ERROR = null; ErrorList = null; IdList = [| "16766766"; "4758956"; "16422035"; "4104812" |]; QueryKey = null; QueryTranslation = "000200020[molecular weight]"; RetMax = "4"; RetStart = "0"; TranslationSet = [||]; TranslationStack = [|gov.nih.nlm.ncbi.eutils.TermSetType; "GROUP"|]; WarningList = null; WebEnv = null;}
The ESearch utility is the most general purpose way to interrogate the database and, consequently, it is the most useful approach. For example, repeated searches might be used to determine the distribution of the molecular weights of proteins in the database for different kinds of protein.
The web service to the EFetch utility retrieves specific records from the given NCBI database in the requested format according to a list of one or more unique identifiers.
For example, fetching the record from the taxonomy database that has the ID 9,685:
> let res = serv.run_eFetch_MS("taxonomy", "9685", "", "", "", "", "", "", "", "", "", "", "", "",);; val res : gov.nih.nlm.ncbi.eutils.eFetchResultType
Once again, the result res
encapsulates information about the database record that was fetched. In this case, the following three properties of the res
object describe the result:
> res.TaxaSet.[0].ScientificName;; val it : string = "Felis catus" > res.TaxaSet.[0].Division;; val it : string = "Mammals" > res.TaxaSet.[0].Rank;; val it : string = "species"
These properties of the response res
show that this record in the taxonomy database refers to cats.
As the EFetch utility fetches specific records, its use is much more limited than the ESearch utility. However, fetching specific database records is much faster than searching the whole database for all records satifying given criteria.
Thus far, this chapter has described how F# programs can interrogate a variety of third party repositories of information in order to mine them for relevant data. Many of these repositories are actually stored in the form of relational databases exposed via web services. Relational database technology has many advantages such as simplifying and optimizing searches and, in particular, handling concurrent reads and writes by many different users or programs. This technology is ubiquitous in industry and is used to store everything from interactive data from company websites to client databases. Consequently, Microsoft have developed one of the world's most advanced database systems available, in the form of SQL Server. Moreover, this software is freely available in a largely-unrestricted form. So SQL Server can be a valuable tool for scientists wanting to maintain their own repositories of information.
This section describes how instances of SQL Server can be controlled from F# programs using Microsoft's ADO.NET interface. However, exactly the same ap-proach can be used to manipulate and interrogate many other relational database implementations including Firebird.NET, MySql and Oracle. There is a vast body of literature with more detailed information on relational databases [25, 15].
The relational database interfaces for SQL Server are provided in the following two namespaces:
> open System.Data;; > open System.Data.SqlClient;;
Before a database can be interrogated, a connection to it must be opened.
Databases are connected to using "connection strings". Although these can be built naively using string concatenation, that is a potential security risk because incorrect information might be supplied by the user and injected into the database by the application. So programmatic construction of connection strings is more secure. The SqlConnectionStringBuilder
class provides exactly this functionality:
> let connString = new SqlConnectionStringBuilder();; val connString : SqlConnectionStringBuilder
The following lines indicate that we're accessing a database hosted by SQL Server Express using its integrated security feature (the connection might be rejected if we don't specify this):
> connString.DataSource <- @".SQLEXPRESS";; val it : unit = () > connString.IntegratedSecurity <- true;; val it : unit = ()
The following lines create a SqlConnect
ion object that will use our connection string when opening its connection to the database:
> let conn = new SqlConnection();; val conn : SqlConnection > conn.ConnectionString <- connString.ConnectionString;; val it : unit = ()
The connection is actually opened by calling the Open
member of the SQL connection:
> conn.O val it : unit = ()
An open database connection can be used to interrogate and manipulate the database.
The Simple Query Language (SQL) is actually a complete programming language used to describe queries sent to databases such that they may be executed quickly and efficiently. Given that SQL is a text-based language, it is tempting to construct queries by concatenating strings and sending the result as a query. However, this is a bad idea because this approach is insecure[20]. Specifically, injecting parameters into a query by concatenating strings will fail if the parameter happens to be valid SQL and, worse, the parameter might contain SQL code that results in data being deleted or corrupted. The safe alternative is to construct SQL queries programatcally using the SqlCommand
class.
The following function can be used to execute SQL statements:
> let execNonQuery conn s = let comm = new SqlCommand(s, conn, CommandTimeout=10) try comm.ExecuteNonQuery() |> ignore with e -> printf "Error: %A " e;; val execNonQuery : SqlConnection -> string -> unit
For example, we can now create our database using the SQL statement CREATE DATABASE
:
> execNonQuery conn "CREATE DATABASE chemicaelements";; val it : unit = ()
Similarly, we can create a database table to hold the data about our chemical elements:
> execNonQuery conn "CREATE TABLE Elements ( Name varchar(50) NOT NULL, Number int NOT NULL, Weight float NOT NULL, PRIMARY KEY (Number))";; val it : unit = ()
The SQL type varchar (n) denotes a string with a maximum length of n. Note that we are using the atomic number as the primary key for this table because this value uniquely identifies a chemical element.
The following SQL statements add two rows to our database table, for Hydrogen and Helium:
> execNonQuery conn "INSERT INTO Elements (Name, Number, Weight) VALUES ('Hydrogen', 1, 1.008)";; val it : unit = () > execNonQuery conn "INSERT INTO Elements (Name, Number, Weight)
VALUES ('Helium', 2, 4.003)" val it : unit = ()
In addition to manipulating a database using SQL statements, the contents of a database can be interrogated using SQL expressions.
We can query the database to see the current contents of our table using the following function:
> let query() = let query = "SELECT Name, Number, Weight FROM Elements" seq { let conn_string = connString.ConnectionString use conn = new SqlConnection(conn_string) do conn.Open() use comm = new SqlCommand(query, conn) use reader = comm.ExecuteReader() while reader.Read() do yield (reader.GetString 0, reader.GetInt32 1, reader.GetDouble 2) };; val query : unit -> seq<string * int * float>
Note the use of an imperative sequence expression to enumerate the rows in the database table and dispose of the connection when enumeration of the sequence is complete. The reader object is used to obtain results for given database columns of the expected type. In this case, columns 0,1 and 2 contain strings (chemical names), ints (atomic numbers) and double-precision floats (atomic weights).
Executing this query returns the data for Hydrogen and Helium from the database as expected:
> query();; val it : seq<string * int * float> = seq [("Hydrogen", 1, 1.008); ("Helium", 2, 4.003)]
Accessing databases by encoding SQL commands as strings is fine for trivial, rare and interactive operations like creating the database itself but is not suitable for more sophisticated use. To perform significant amount of computation on the database, such as injecting the data we downloaded from the web service, we need to access the database programmatically.
The SqlDataAdapter
acts as a bridge between a DataSet
and SQL Server for retrieving and saving data:
> let dataAdapter = new SqlDataAdapter (),;; val dataAdapter : SqlDataAdapter
A DataSet
is an in-memory cache of data retrieved from a data source. The following function queries our database and fills a new DataSet
with the results:
> let buildDataSet conn query = dataAdapter.SelectCommand <- new SqlCommand(query, conn) let dataSet = new DataSet() new SqlCommandBuilder(dataAdapter) |> ignore dataAdapter.Fill dataSet |> ignore dataSet;; val buildDataSet : SqlConnection -> string -> DataSet
For example, the following query finds all rows in the database table Elements
and returns all three columns in each:
> let dataSet = buildDataSet conn "SELECT Name, Number, Weight from Elements";; val dataSet : DataSet
The following extracts the DataTable
of results from this DataSet
and iterates over the rows printing the results:
> let table = dataSet.Tables.Item 0;; val table : DataTable > for row in table.Rows do printf "%A " (row.Item "Name", row.Item "Number", row.Item "Weight");; ("Hydrogen", 1, 1.008) ("Helium", 2, 4.003) val it : unit = ()
Note how the value of the field with the string name field
in the row row
is obtained using row. Item
field.
In addition to programmatically enumerating over the results of a query, the DataSet
can be used to inject data into the database programmatically as well. The following creates a new row in the table and populates it with the data for Lithium:
> let row = table.NewRow();; val row : DataRow > row.Item "Name" <- "Lithium";; val it : unit = () > row.Item "Number" <- 3;; val it : unit = ()
> row.Item "Weight" <- 6.941;; val it : unit = () > table.Rows.Add row;; val it : unit = ()
This change can be uploaded to the database using the Update
member of the SqlDataAdapter
:
> dataAdapter.Update dataSet;; val it : int = 1
The return value of 1 indicates that a single row was altered.
Querying the database again shows that it does indeed now contain three rows:
> query();; val it : seq<string * int * float> = seq [("Hydrogen", 1, 1.008); ("Helium", 2, 4.003); ("Lithium", 3, 6.941)]
Databases are much more useful when they are filled by a program.
Now we're ready to inject data about all of the chemical elements into the database programmatically. We begin by deleting the three existing rows to avoid conflicts:
> execNonQuery conn "DELETE FROM Elements";; val it : unit = ()
The following loop adds the data for each element (assuming the existence of a data structure elements
that is a sequence of records with the appropriate fields) to the table and then uploads the result:
> for element in elements do let row = table.NewRow() row.Item "Name" <- element.name row.Item "Number" <- element.number row.Item "Weight" <- element.weight table.Rows.Add row;; val it : unit = () > dataAdapter .Update dataSet |> ignore;; val it : unit = ()
The database now contains information about the chemical elements from this data structure.
As an industrial-strength platform, .NET naturally makes it as easy as possible to visualize the contents of a database table.
We begin by creating a blank Windows Form:
> open System.Windows.Forms;; > let form = new Form(Text="Elements", Visible=true);; val form : Form
As usual, forcing the form to stay on top is useful when developing in an interactive session:
> form.TopMost <- true;; val it : unit = ()
The DataGrid
class provides a Windows Forms control that can be bound to a database table in order to visualize it interactively:
> let grid = new DataGrid(DataSource=table);; val grid : DataGrid > grid.Dock <- DockStyle.Fill;; val it : unit = ()
Note that the grid was bound to our database table and its dock style was set to fill the whole form when it is added:
> form.Controls.Add grid;; val it : unit = ()
This tiny amount of code produces the interactive GUI application illustrated in figure 10.2.
One of the key advantages of using a database is persistence: the contents of the database will still be here the next time we restart F# or Visual Studio or even the machine itself. However, we inevitably want to delete our rows, tables and databases. This line deletes the Elements
table:
> execNonQuery conn "DROP TABLE Elements";; val it : unit = ()
And this line deletes the database itself:
> execNonQuery conn "DROP DATABASE chemicalelements";; val it : unit = ()
This chapter has shown how web services can be consumed easily in F# programs by reusing the capabilities provided for C#, and how databases can be created and used from F# programs with minimal effort.
Web applications and databases are the bread and butter of the .NET platform and a great many C# and Visual Basic programs already use this technology. However, F# is the first modern functional programming language to provide professional-quality web and database functionality and, consequently, is opening new avenues for combining these techniques. Functional programming will doubtless play an increasingly important role in web and database programming just as it is changing the way we think about other areas of programming.
[20] Even if the database is private and will not be subjected to malicious attacks it can still be corrupted accidentally.
3.139.86.18