apybiomart¶
Async pythonic interface to BioMart.
Free software: MIT license
Documentation: https://apybiomart.readthedocs.io
GitHub repo: https://github.com/robertopreste/apybiomart
Features¶
apybiomart is a Python module which provides a simple asynchronous interface to Ensembl BioMart. Users can exploit the async interface to schedule multiple queries using all the commodities offered by Python’s asyncio library.
Depending on specific needs, apybiomart offers different entry points:
an asynchronous
aquery()
function, to schedule multiple queries in the same event loop;a synchronous
query()
function, which can be used for exploratory queries, executed in real time;a set of synchronous
find_*()
functions, which can be used to retrieve the list of available marts (find_marts()
), datasets for a specific mart (find_datasets()
), attributes (find_attributes()
) and filters (find_filters()
) for a specific dataset.a set of related CLI commands also exists to allow exploration of these data from the command line; these are, respectively,
apybiomart marts
,apybiomart datasets
,apybiomart attributes
andapybiomart filters
. Runapybiomart --help
for further details.
Please refer to the Usage section of the documentation for further information.
Background¶
apybiomart was originally born as a fork of the great pybiomart package.
I was working on a project that employed a series of async calls to several online resources, but I couldn’t manage to perform asynchronous calls to BioMart using that package, so I decided to modify it to better suit my needs. However, it gradually evolved into a very different thing: the original implementation was rewritten and the structure of the package changed a bit, in a way that I found most useful for my purpose.
This said, all the credits go to jrderuiter, which created the original pybiomart package.
Installation¶
apybiomart only supports Python 3, and can be installed using pip:
pip install apybiomart
Please refer to the Installation section of the documentation for further information.
Credits¶
This package was created with Cookiecutter and the cc-pypackage project template.
Table Of Contents¶
apybiomart¶
Async pythonic interface to BioMart.
Free software: MIT license
Documentation: https://apybiomart.readthedocs.io
GitHub repo: https://github.com/robertopreste/apybiomart
Features¶
apybiomart is a Python module which provides a simple asynchronous interface to Ensembl BioMart. Users can exploit the async interface to schedule multiple queries using all the commodities offered by Python’s asyncio library.
Depending on specific needs, apybiomart offers different entry points:
an asynchronous
aquery()
function, to schedule multiple queries in the same event loop;a synchronous
query()
function, which can be used for exploratory queries, executed in real time;a set of synchronous
find_*()
functions, which can be used to retrieve the list of available marts (find_marts()
), datasets for a specific mart (find_datasets()
), attributes (find_attributes()
) and filters (find_filters()
) for a specific dataset.a set of related CLI commands also exists to allow exploration of these data from the command line; these are, respectively,
apybiomart marts
,apybiomart datasets
,apybiomart attributes
andapybiomart filters
. Runapybiomart --help
for further details.
Please refer to the Usage section of the documentation for further information.
Background¶
apybiomart was originally born as a fork of the great pybiomart package.
I was working on a project that employed a series of async calls to several online resources, but I couldn’t manage to perform asynchronous calls to BioMart using that package, so I decided to modify it to better suit my needs. However, it gradually evolved into a very different thing: the original implementation was rewritten and the structure of the package changed a bit, in a way that I found most useful for my purpose.
This said, all the credits go to jrderuiter, which created the original pybiomart package.
Installation¶
apybiomart only supports Python 3, and can be installed using pip:
pip install apybiomart
Please refer to the Installation section of the documentation for further information.
Credits¶
This package was created with Cookiecutter and the cc-pypackage project template.
Installation¶
PLEASE NOTE: apybiomart only supports Python 3!
Stable release¶
To install apybiomart, run this command in your terminal:
$ pip install apybiomart
This is the preferred method to install apybiomart, as it will always install the most recent stable release.
If you don’t have pip installed, this Python installation guide can guide you through the process.
From sources¶
The sources for apybiomart can be downloaded from the Github repo.
You can either clone the public repository:
$ git clone git://github.com/robertopreste/apybiomart
Or download the tarball:
$ curl -OL https://github.com/robertopreste/apybiomart/tarball/master
Once you have a copy of the source, you can install it with:
$ python setup.py install
Usage¶
apybiomart can be used in a project with a simple import:
import apybiomart
The main purpose of the package is to perform queries on BioMart (either synchronously or asynchronously), however users may first need to explore the available marts, datasets, attributes and filters.
In addition to interactively inspect these results, users can also save them to a CSV file,
using the --save
flag on the CLI and the save=True
argument in Python, and optionally
specify a filename using the --output <filename.csv>
option on the CLI and the
output="filename.csv"
argument in Python.
Marts, datasets, attributes and filters¶
BioMart contains different databases, called marts, each of which in turn contains several datasets, each related to a specific species. These datasets can be queried and it is possible to restrict the amount of data returned to one or more particular types of information, namely attributes, and using filters that only retain data satisfying one or more specific criteria.
For more information, please refer to BioMart’s help page.
Marts¶
In order to view the marts available on BioMart, the find_marts()
function can be used:
from apybiomart import find_marts
find_marts()
A dataframe with the available marts is returned, with their proper name
and display_name
:
Mart_ID Mart_name
0 ENSEMBL_MART_ENSEMBL Ensembl Genes 96
1 ENSEMBL_MART_MOUSE Mouse strains 96
2 ENSEMBL_MART_SEQUENCE Sequence
3 ENSEMBL_MART_ONTOLOGY Ontology
4 ENSEMBL_MART_GENOMIC Genomic features 96
5 ENSEMBL_MART_SNP Ensembl Variation 96
6 ENSEMBL_MART_FUNCGEN Ensembl Regulation 96
A CLI command is also available to retrieve the same information: apybiomart marts
.
Datasets¶
Available datasets for a specific mart can be retrieved using the find_datasets()
function:
from apybiomart import find_datasets
find_datasets(mart="ENSEMBL_MART_ENSEMBL")
# same as above, using the default mart
find_datasets()
The find_datasets()
function accepts an optional mart
argument, which defaults to
“ENSEMBL_MART_ENSEMBL”. The returned dataframe contains all the available datasets in the
given mart, with their name
, display_name
and the mart
to which they belong:
Dataset_ID Dataset_name Mart_ID
0 rroxellana_gene_ensembl Golden snub-nosed monkey genes (Rrox_v1) ENSEMBL_MART_ENSEMBL
1 ggallus_gene_ensembl Chicken genes (GRCg6a) ENSEMBL_MART_ENSEMBL
2 dmelanogaster_gene_ensembl Drosophila melanogaster genes (BDGP6.22) ENSEMBL_MART_ENSEMBL
.. ... ... ...
181 sdorsalis_gene_ensembl Yellowtail amberjack genes (Sedor1) ENSEMBL_MART_ENSEMBL
182 ohni_gene_ensembl Japanese medaka HNI genes (ASM223471v1) ENSEMBL_MART_ENSEMBL
183 pmarinus_gene_ensembl Lamprey genes (Pmarinus_7.0) ENSEMBL_MART_ENSEMBL
A CLI command is also available to retrieve the same information: apybiomart datasets
, whose
--mart
option can be used to specify which mart will be used (default is
“ENSEMBL_MART_ENSEMBL”).
Attributes¶
When querying a dataset, users may want to retrieve specific attributes; the find_attributes()
function accepts an optional dataset
(defaulting to “hsapiens_gene_ensembl”) and gathers all
the available attributes for the given dataset:
from apybiomart import find_attributes
find_attributes(dataset="hsapiens_gene_ensembl")
# same as above, using the default dataset
find_attributes()
The dataframe returned contains each attribute’s name
, display_name
, description
(where available), and the dataset
to which it belongs:
Attribute_ID Attribute_name Attribute_description Dataset_ID
0 ensembl_gene_id Gene stable ID Stable ID of the Gene hsapiens_gene_ensembl
1 ensembl_gene_id_version Gene stable ID version Versionned stable ID of the Gene hsapiens_gene_ensembl
2 ensembl_transcript_id Transcript stable ID Stable ID of the Transcript hsapiens_gene_ensembl
.. ... ... ... ...
3348 cds_length CDS Length hsapiens_gene_ensembl
3349 cds_start CDS start hsapiens_gene_ensembl
3350 cds_end CDS end hsapiens_gene_ensembl
A CLI command is also available to retrieve the same information: apybiomart attributes
, whose
--dataset
option can be used to specify which dataset will be used (default is
“hsapiens_gene_ensembl”).
Filters¶
Datasets can be queried using filters that restrict the returned information to some specific
subset of interest (e.g. chromosome, start position, etc.). In order to retrieve the list of
filters available for a given dataset, the find_filters()
function can be used:
from apybiomart import find_filters
find_filters("hsapiens_gene_ensembl")
# same as above, using the default dataset
find_filters()
This function accepts an optional dataset
argument, which defaults to “hsapiens_gene_ensembl”,
and returns a dataframe with the name
, type
, description
(where available) of each
filter, as well as the dataset
to which it belongs:
Filter_ID Filter_type Filter_description Dataset_ID
0 link_so_mini_closure list hsapiens_gene_ensembl
1 link_go_closure text hsapiens_gene_ensembl
2 link_ensembl_transcript_stable_id text hsapiens_gene_ensembl
.. ... ... ... ...
39 germ_line_variation_source list hsapiens_gene_ensembl
40 somatic_variation_source list hsapiens_gene_ensembl
42 so_consequence_name list hsapiens_gene_ensembl
A CLI command is also available to retrieve the same information: apybiomart filters
, whose
--dataset
option can be used to specify which dataset will be used (default is
“hsapiens_gene_ensembl”).
Queries¶
Once the desired mart, dataset, attributes and filters have been explored (or if they were known beforehand), it is possible to query BioMart to retrieve the actual data; queries can be performed synchronously or asynchronously.
Exploring the difference between these two approaches is out of the scope of this document, but basically while in synchronous calls the client has to wait for a request to be complete before moving to the next one, in asynchronous calls the client can perform another request while the first one is idle, and so on until all the requests have been performed and a response was returned.
Simply put, apybiomart allows to perform synchronous queries to explore the data, and asynchronous queries to group multiple queries and run them efficiently.
Synchronous Queries¶
Synchronous queries can be performed using the query()
function, which accepts attributes
and filters
arguments, and an optional dataset
argument (which defaults to
“hsapiens_gene_ensembl”):
from apybiomart import query
query(attributes=["ensembl_gene_id", "external_gene_name"],
filters={"chromosome_name": "1"},
dataset="hsapiens_gene_ensembl")
The attributes
are provided as a list of properties, while filters
are represented by a
filter name : filter value dictionary. The returned dataframe contains the result of the query,
restricted according to the provided filters and attributes.
Asynchronous Queries¶
Asynchronous queries can be performed using the aquery()
function, which works just like
query()
, with the only difference that this is an async coroutine, so it needs to be handled
properly taking advantage of the asyncio
event loop:
import asyncio
from apybiomart import aquery
loop = asyncio.get_event_loop()
loop.run_until_complete(
aquery(attributes=["ensembl_gene_id", "external_gene_name"],
filters={"chromosome_name": "1"},
dataset="hsapiens_gene_ensembl")
)
This allows to group multiple queries together, and the event loop will take care of scheduling them for execution:
import asyncio
from apybiomart import aquery
loop = asyncio.get_event_loop()
tasks = [aquery(attributes=["ensembl_gene_id", "external_gene_name"],
filters={"chromosome_name": str(i)},
dataset="hsapiens_gene_ensembl") for i in range(3)]
loop.run_until_complete(asyncio.gather(*tasks))
It is of course possible to assign the query results to one or more specific variables, for future usage:
# replacing last line of the previous code snippet
single_result = loop.run_until_complete(asyncio.gather(*tasks))
# or using multiple variables
chrom1, chrom2, chrom3 = loop.run_until_complete(asyncio.gather(*tasks))
Please refer to the asyncio documentation for more information.
API¶
Python Module¶
Entry points¶
These functions are available after you import apybiomart
and should be used as the main entry
points for apybiomart. If you want more control, you can use the internal classes described below.
-
async
apybiomart.apybiomart.
aquery
(attributes: List[str], filters: Dict[str, Union[str, int, list, tuple, bool]], dataset: str = 'hsapiens_gene_ensembl', save: bool = False, output: str = 'apybiomart_aquery.csv') → pandas.core.frame.DataFrame[source]¶ Launch asynchronous query using the given attributes, filters and dataset.
- Parameters
attributes – list of attributes to include
filters – dict of filter name : value to filter results
dataset – BioMart dataset name (default: “hsapiens_gene_ensembl”)
save – save results to a CSV file [default: False]
output – output filename if saving results [default: ‘apybiomart_aquery.csv’]
-
apybiomart.apybiomart.
find_attributes
(dataset: str = 'hsapiens_gene_ensembl', save: bool = False, output: str = 'apybiomart_attributes.csv') → pandas.core.frame.DataFrame[source]¶ Retrieve and list available attributes for a given mart.
- Parameters
dataset – BioMart dataset name (default: “hsapiens_gene_ensembl”)
save – save results to a CSV file [default: False]
output – output filename if saving results [default: ‘apybiomart_attributes.csv’]
-
apybiomart.apybiomart.
find_datasets
(mart: str = 'ENSEMBL_MART_ENSEMBL', save: bool = False, output: str = 'apybiomart_datasets.csv') → pandas.core.frame.DataFrame[source]¶ Retrieve and list available datasets for a given mart.
- Parameters
mart – BioMart mart name (default: “ENSEMBL_MART_ENSEMBL”)
save – save results to a CSV file [default: False]
output – output filename if saving results [default: ‘apybiomart_datasets.csv’]
-
apybiomart.apybiomart.
find_filters
(dataset: str = 'hsapiens_gene_ensembl', save: bool = False, output: str = 'apybiomart_filters.csv') → pandas.core.frame.DataFrame[source]¶ Retrieve and list available filters for a given mart.
- Parameters
dataset – BioMart dataset name (default: “hsapiens_gene_ensembl”)
save – save results to a CSV file [default: False]
output – output filename if saving results [default: ‘apybiomart_filters.csv’]
-
apybiomart.apybiomart.
find_marts
(save: bool = False, output: str = 'apybiomart_marts.csv') → pandas.core.frame.DataFrame[source]¶ Retrieve and list available marts.
- Parameters
save – save results to a CSV file [default: False]
output – output filename if saving results [default: ‘apybiomart_marts.csv’]
-
apybiomart.apybiomart.
query
(attributes: List[str], filters: Dict[str, Union[str, int, list, tuple, bool]], dataset: str = 'hsapiens_gene_ensembl', save: bool = False, output: str = 'apybiomart_query.csv') → pandas.core.frame.DataFrame[source]¶ Launch synchronous query using the given attributes, filters and dataset.
- Parameters
attributes – list of attributes to include
filters – dict of filter name : value to filter results
dataset – BioMart dataset name (default: “hsapiens_gene_ensembl”)
save – save results to a CSV file [default: False]
output – output filename if saving results [default: ‘apybiomart_query.csv’]
Internal classes¶
These are the internal classes used by apybiomart, and can be imported with
from apybiomart.classes import <ClassName>
. Use them if you want more control over the
application.
-
class
apybiomart.classes.
AttributesServer
(dataset: str, save: bool = False, output: str = 'apybiomart_attributes.csv')[source]¶ Class used to retrieve and list available attributes for a dataset.
-
dataset
¶ BioMart dataset name
-
-
class
apybiomart.classes.
DatasetServer
(mart: str, save: bool = False, output: str = 'apybiomart_datasets.csv')[source]¶ Class used to retrieve and list available datasets for a mart.
-
mart
¶ BioMart mart name
-
-
class
apybiomart.classes.
FiltersServer
(dataset: str, save: bool = False, output: str = 'apybiomart_filters.csv')[source]¶ Class used to retrieve and list available filters for a dataset.
-
dataset
¶ BioMart dataset name
-
-
class
apybiomart.classes.
MartServer
(save: bool = False, output: str = 'apybiomart_marts.csv')[source]¶ Class used to retrieve and list available marts.
-
class
apybiomart.classes.
Query
(attributes: List[str], filters: Dict[str, Union[str, int, list, tuple, bool]], dataset: str, save: bool = False, output: str = 'apybiomart_query.csv')[source]¶ Class used to perform either synchronous or asynchronous queries on BioMart.
-
attributes
¶ list of attributes to include
-
filters
¶ dict of filter name : value to filter results
-
dataset
¶ BioMart dataset name
-
Command Line Interface¶
Contributing¶
Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.
You can contribute in many ways:
Types of Contributions¶
Report Bugs¶
Report bugs at https://github.com/robertopreste/apybiomart/issues.
If you are reporting a bug, please include:
Your operating system name and version.
Any details about your local setup that might be helpful in troubleshooting.
Detailed steps to reproduce the bug.
Fix Bugs¶
Look through the GitHub issues for bugs. Anything tagged with “bug” and “help wanted” is open to whoever wants to implement it.
Implement Features¶
Look through the GitHub issues for features. Anything tagged with “enhancement” and “help wanted” is open to whoever wants to implement it.
Write Documentation¶
apybiomart could always use more documentation, whether as part of the official apybiomart docs, in docstrings, or even on the web in blog posts, articles, and such.
Submit Feedback¶
The best way to send feedback is to file an issue at https://github.com/robertopreste/apybiomart/issues.
If you are proposing a feature:
Explain in detail how it would work.
Keep the scope as narrow as possible, to make it easier to implement.
Remember that this is a volunteer-driven project, and that contributions are welcome :)
Get Started!¶
Ready to contribute? Here’s how to set up apybiomart for local development.
Fork the apybiomart repo on GitHub.
Clone your fork locally:
$ git clone git@github.com:your_name_here/apybiomart.git
Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:
$ mkvirtualenv apybiomart $ cd apybiomart/ $ python setup.py develop
Create a branch for local development:
$ git checkout -b name-of-your-bugfix-or-feature
Now you can make your changes locally.
When you’re done making changes, check that your changes pass flake8 and the tests, including testing other Python versions with tox:
$ flake8 apybiomart tests $ python setup.py test # or pytest $ tox
To get flake8 and tox, just pip install them into your virtualenv.
Commit your changes and push your branch to GitHub:
$ git add . $ git commit -m "Your detailed description of your changes." $ git push origin name-of-your-bugfix-or-feature
Submit a pull request through the GitHub website.
Pull Request Guidelines¶
Before you submit a pull request, check that it meets these guidelines:
The pull request should include tests.
If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.
The pull request should work for Python 3.5 and 3.6, and for PyPy. Check https://travis-ci.org/robertopreste/apybiomart/pull_requests and make sure that the tests pass for all supported Python versions.
Tips¶
To run a subset of tests:
$ pytest tests.test_apybiomart
Deploying¶
A reminder for the maintainers on how to deploy. Make sure all your changes are committed (including an entry in HISTORY.rst). Then run:
$ bump2version patch # possible: major / minor / patch
$ git push
$ git push --tags
WIP: Travis will then deploy to PyPI if tests pass.
Credits¶
Development Lead¶
Roberto Preste <robertopreste@gmail.com>
Contributors¶
None yet. Why not be the first?
History¶
0.1.0 (2019-03-26)¶
First development release.
0.1.1 (2019-03-27)¶
Requests are converted to async calls;
Code style is clean and Python 3 compatible.
0.1.2 (2019-03-27)¶
Add basic tests.
0.2.0 (2019-03-31)¶
New version with different organisation of classes and functions;
Sync
query
and asyncaquery
functions to query Biomart;Sync
list_*
functions to retrieve availablemarts
,datasets
,filters
andattributes
.
0.2.1 (2019-04-01)¶
Add tests.
0.2.2 (2019-04-01)¶
Basic functions working and tested;
Fix documentation;
Update requirements.
0.2.3 (2019-04-02)¶
Update requirements;
Fix type hints for query functions;
Reorganise query classes into a single class;
Update documentation.
0.2.4 (2019-04-04)¶
Fix type hints;
Fix docstrings in classes;
Add docstrings to main entry points.
0.2.5 (2019-04-09)¶
Fix test files with new BioMart versions;
Add script to create test files automatically.
0.2.6 (2019-04-29)¶
Update test files;
Fix and update documentation.
0.3.0 (2019-05-05)¶
Change
list_*
functions names tofind_*
for better compliance;Update documentation.
0.3.1 (2019-05-11)¶
Fix requirements handling;
Add function to check internet connection.
0.3.2 (2019-05-29)¶
Correct minor typos;
Update documentation and testfiles.
0.3.3 (2019-07-29)¶
Fix #37 - issue with the requests module not installed.
0.3.4 (2019-08-23)¶
Better handling of filters arguments for
query()
andaquery()
functions;Convert docstrings to Google style;
Fix documentation.
0.3.5 (2019-08-25)¶
Relax requirement versions.
0.4.0 (2020-01-26)¶
Add CLI commands for finding marts, datasets, attributes and filters;
Change output dataframe column names.
0.5.0 (2020-03-22)¶
Add CLI and Python module options to save outputs to CSV file.
0.5.1 (2020-04-04)¶
Add option to specify the output CSV filename.
0.5.2 (2020-06-06)¶
Update tests and test files;
Clean code;
Add CI module.
0.5.3 (2020-11-30)¶
Remove deprecated
pd.np
occurrencies;Update test files.