data-libs

Table of Contents

1   Overview

Here is a repository of libraries automatically generated by Patlac::Xml2cpp software. All of these libraries are c++ translation of their respective xsd schema and as such are bounded by their respective license. The sources include classes for every types described in the schema, a serialization function along with a saxparser that can be used in iterative mode for constant memory usage.

2   Installation

2.1   Download

Libraries forming the data-libs collection are all available as sources tar.gz at http://sourceforge.net/projects/data-libs. You have to compile them:

]$ tar xzvf libdata_uniprot-1.0.1.tar.gz
]$ cd libdata_uniprot-1.0.1
]$ ./configure
]$ make
]$ sudo make install

If your not sudoer, you need to tell to the configure script where you want to install the package:

]$ ./configure --prefix=/home/username --exec-prefix=/home/username
]$ make
]$ make install

Don't forget to add this directory to your LD_LIBRARY_PATH environment variable if you want to use the library.

2.2   Dependencies

These libraries are all delivered with a saxparser implemented in terms of the SaxMagique <http://saxmagique.sourceforge.net/> library. SaxMagique is a c++ library that uses the stl and boost to compile. libexpat and libz have to be linked. For libexpat, you'll need a devel version >= 1.95.8 in order to successfully compile. For boost, devel version >= 1.33.1 is required. If these libraries are not in a standard directory, use --with-boost=dir or --with-expat=dir. For example, if you got these libraries from sources and installed them in /usr/local:

]$ ./configure --with-boost=/usr/local --with-expat=/usr/local

In intend to link to one of these library, you'll have to add these flags to your LDFLAGS variable : "-libdata_xxx -lpatlac_common -lexpat -lsaxmagique" This can be done via m4 macros and files. See the proteinsconv project for examples of how to do this.

All these dependencies are implementation details. If ever you're really interested in one of these libraries but are chocked by details, do not hesitate in sending me a request. However, notice that the saxparser would be much more harder to build without the SaxMagique.

3   Supported schemas

3.1   XMLSchema

This schema is huge and has some particularyties. First, it has restriction of <xs:any>, which I'm not sure how to interpret. Also, he has a <xs:sequence> of <xs:choice>, which I simplify to be a simple sequence. As a side effect, annotation are all grouped together instead of staying where they should be.

The libdata_xs library is used by Patlac::Xml2cpp software which has generated libdata_xs. It is a strong proof of usability, but also a chicken-egg paradoxe.

3.2   uniprot and uniref

These two pretty bug schema are used by the illustrative software proteinsconv .

3.3   pepXML and protXML

Here are two xsd schema which makes use of <xs:any/> . This is not yet implemented. However, the rest of the schema works well. In particular, pepXML generated by the trans_proteomic_pipeline (Tandem2XML) from convertion of bioml are totally supported.

3.4   Part of bioml and gaml

I have wrote three xsd schema that represent the input and output files from X!Tandem. These schema have to be distinguish from bioml.dtd and gaml110.xsd that can be found on internet. Even if they are related, they differ and are not compatible.

3.5   mzXML

The mzXML's schema itself is not enough to define mzXML files. The sha1 sum and the offset that are required are not part of xsd specification. As they have no meaning outside of a stream, they are computed by the serialization functions write_xml. The way it is done is a little bit tricky but totally transparent for end user of the library.

The offsets calculated may be tested with any software that makes use it, for instance msInspect.

The mzXML's schema also has a particularity which is not yet handled here. The dynamic inheritance concept which is specifies in mzXML' xsd-schema for separationTechniqueType by the abstract="true" attribute doesn't has vis-à-vis in this library.

4   proteinsconv

proteinsconv is an application that reads both uniprot or uniref formatted xml files and converts them to fasta file and option ally outputs some variant sequences. It may also export a bioml file describing Single Amino Acid Polymorphisms (SAPS) suited for X!Tandem. The proteinsconv application makes use of the classes generated from three different schemas ( uniref.xsd, uniprot.xsd and bioml_saps.xsd ).

Please consider the software proteinsconv as an illustration how to use these libraries. Most of your basic interrogations can be answered by looking into the sources of this software. In particular, you may be interested in :

  1. How to use m4 macros and files to test the presence of libraries and to adjust the CPPFLAGS and LDFLAGS accordingly. It is explained into configure.ac and Makefile.am files from the src directory. You will need libdata_XXX.m4 files located in the m4 directory of each libdata.
  2. How to iterate through an xml file using constant amount of memory. This is done into src/main.cpp file.
  3. How to write results in an xml file, still using constant amount of memory. This is done into src/bioml_saps_creator.cpp file.
  4. How to get and set data into instance of library classes. This is done both into src/fastarizer.hpp and src/bioml_saps_creator.cpp files. Notice the use of boost::bind to easily create predicate suitable for stl algorithms.
  5. How to activate the gettext polyglot functionality. This is done into src/main.cpp file.

proteinsconv is an application that reads both uniprot or uniref formatted xml files and converts them to fasta file and optionally outputs some {it variant sequences}. It may also export a bioml file describing Single Amino Acid Polymorphisms (SAPS) suited for X!Tandem. The proteinsconv application makes use of the classes generated from three different schemas ( uniref.xsd, uniprot.xsd and bioml_saps.xsd ).

5   Licence

All the software and librairies distributed here are GPL licensed, except if it conflicts with the license of their respective xsd schema (I'm not a lawyer). It means that they are totally free of charge and may be used, modified and distributed as long as the sources codes are made available, but they are sent without any kind of warranty neither for usability. In particular, if you distribute modified version of these contents, you have to make publicly available the source code of your modifications and if you distribute softwares or libraries that use these contents you have to make publicly available the source code of your softwares or libraries.

6   Auteur

If you use one of these library, I'd like to know about it, please contact me.

Copyright 2007 Patrick Lacasse