Sunday, January 23, 2011

Bioinformatics Biocomputing and Perl






Contents
Preface xv
1 Setting the Biological Scene 1
1.1 Introducing Biological Sequence Analysis 1
1.2 Protein and Polypeptides 4
1.3 Generalised Models and their Use 5
1.4 The Central Dogma of Molecular Biology 6
1.4.1 Transcription 6
1.4.2 Translation 7
1.5 Genome Sequencing 10
1.5.1 Sequence assembly 11
1.6 The Example DNA-gene-protein system we will use 12
Where to from Here 13
2 Setting the Technological Scene 15
2.1 The Layers of Technology 15
2.1.1 From passive user to active developer 16
2.2 Finding perl 17
2.2.1 Checking for perl 17
Where to from Here 18
I Working with Perl 19
3 The Basics 21
3.1 Let’s Get Started! 21
3.1.1 Running Perl programs 22
3.1.2 Syntax and semantics 23
3.1.3 Program: run thyself! 25
3.2 Iteration 26
3.2.1 Using the Perl while construct 26
3.3 More Iterations 30
3.3.1 Introducing variable containers 31
3.3.2 Variable containers and loops 32
3.4 Selection 34
3.4.1 Using the Perl if construct 35
3.5 There Really is MTOWTDI 36
3.6 Processing Data Files 41
3.6.1 Asking getlines to do more 43
3.7 Introducing Patterns 44
Where to from Here 46
The Maxims Repeated 46
4 Places to Put Things 49
4.1 Beyond Scalars 49
4.2 Arrays: Associating Data with Numbers 49
4.2.1 Working with array elements 51
4.2.2 How big is the array? 51
4.2.3 Adding elements to an array 52
4.2.4 Removing elements from an array 54
4.2.5 Slicing arrays 54
4.2.6 Pushing, popping, shifting and unshifting 56
4.2.7 Processing every element in an array 57
4.2.8 Making lists easier to work with 59
4.3 Hashes: Associating Data with Words 60
4.3.1 Working with hash entries 61
4.3.2 How big is the hash? 61
4.3.3 Adding entries to a hash 62
4.3.4 Removing entries from a hash 62
4.3.5 Slicing hashes 63
4.3.6 Working with hash entries: a complete example 64
4.3.7 Processing every entry in a hash 66
Where to from Here 68
The Maxims Repeated 68
5 Getting Organised 71
5.1 Named Blocks 71
5.2 Introducing Subroutines 73
5.2.1 Calling subroutines 73
5.3 Creating Subroutines 74
5.3.1 Processing parameters 76
5.3.2 Better processing of parameters 78
5.3.3 Even better processing of parameters 80
5.3.4 A more flexible drawline subroutine 83
5.3.5 Returning results 84
5.4 Visibility and Scope 85
5.4.1 Using private variables 86
5.4.2 Using global variables properly 88
5.4.3 The final version of drawline 89
5.5 In-built Subroutines 90
5.6 Grouping and Reusing Subroutines 92
5.6.1 Modules 93
5.7 The Standard Modules 96
5.8 CPAN: The Module Repository 96
5.8.1 Searching CPAN 97
5.8.2 Installing a CPAN module manually 98
5.8.3 Installing a CPAN module automatically 99
5.8.4 A final word on CPAN modules 99
Where to from Here 100
The Maxims Repeated 100
6 About Files 103
6.1 I/O: Input and Output 103
6.1.1 The standard streams: STDIN, STDOUT and STDERR 103
6.2 Reading Files 105
6.2.1 Determining the disk-file names 106
6.2.2 Opening the named disk-files 108
6.2.3 Reading a line from each of the disk-files 110
6.2.4 Putting it all together 110
6.2.5 Slurping 114
6.3 Writing Files 116
6.3.1 Redirecting output 117
6.3.2 Variable interpolation 117
6.4 Chopping and Chomping 118
Where to from Here 119
The Maxims Repeated 119
7 Patterns, Patterns and More Patterns 121
7.1 Pattern Basics 121
7.1.1 What is a regular expression? 122
7.1.2 What makes regular expressions so special? 122
7.2 Introducing the Pattern Metacharacters 124
7.2.1 The + repetition metacharacter 124
7.2.2 The | alternation metacharacter 126
7.2.3 Metacharacter shorthand and character classes 127
7.2.4 More metacharacter shorthand 128
7.2.5 More repetition 130
7.2.6 The ? and * optional metacharacters 130
7.2.7 The any character metacharacter 131
7.3 Anchors 132
7.3.1 The \b word boundary metacharacter 132
7.3.2 The ^ start-of-line metacharacter 133
7.3.3 The $ end-of-line metacharacter 133
7.4 The Binding Operators 134
7.5 Remembering What Was Matched 135
7.6 Greedy by Default 137
7.7 Alternative Pattern Delimiters 138
7.8 Another Useful Utility 139
7.9 Substitutions: Search and Replace 140
7.9.1 Substituting for whitespace 141
7.10 Finding a Sequence 142
Where to from Here 146
The Maxims Repeated 146
8 Perl Grabbag 147
8.1 Introduction 147
8.2 Strictness 147
8.3 Perl One-liners 149
8.4 Running Other Programs from perl 152
8.5 Recovering from Errors 153
8.6 Sorting 155
8.7 HERE Documents 159
Where to from Here 160
The Maxims Repeated 161
II Working with Data 163
9 Downloading Datasets 165
9.1 Let’s Get Data 165
9.2 Downloading from the Web 165
9.2.1 Using wget to download PDB data-files 167
9.2.2 Mirroring a dataset 168
9.2.3 Smarter mirroring 168
9.2.4 Downloading a subset of a dataset 169
Where to from Here 171
The Maxims Repeated 171
10 The Protein Databank 173
10.1 Introduction 173
10.2 Determining Biomolecule Structures 174
10.2.1 X-Ray Crystallography 174
10.2.2 Nuclear magnetic resonance 176
10.2.3 Summary of protein structure methods 177
10.3 The Protein Databank 177
10.4 The PDB Data-file Formats 179
10.4.1 Example structures 180
10.4.2 Downloading PDB data-files 181
10.5 Accessing Data in PDB Entries 182
10.6 Accessing PDB Annotation Data 183
10.6.1 Free R and resolution 184
10.6.2 Database cross references 186
10.6.3 Coordinates section 188
10.6.4 Extracting 3D coordinate data 191
10.7 Contact Maps 192
10.8 STRIDE: Secondary Structure Assignment 196
10.8.1 Installation of STRIDE 197
10.9 Assigning Secondary Structures 197
10.9.1 Using STRIDE and parsing the output 200
10.9.2 Extracting amino acid sequences using STRIDE 204
10.10 Introducing the mmCIF Protein Format 205
10.10.1 Converting mmCIF to PDB 206
10.10.2 Converting mmCIFs to PDB with CIFTr 206
10.10.3 Problems with the CIFTr conversion 208
10.10.4 Some advice on using mmCIF 208
10.10.5 Automated conversion of mmCIF to PDB 208
Where to from Here 210
The Maxims Repeated 210
11 Non-redundant Datasets 211
11.1 Introducing Non-redundant Datasets 211
11.1.1 Reasons for redundancy 211
11.1.2 Reduction of redundancy 212
11.1.3 Non-redundancy and non-representative 212
11.2 Non-redundant Protein Structures 213
Where to from Here 217
The Maxims Repeated 217
12 Databases 219
12.1 Introducing Databases 219
12.1.1 Relating tables 220
12.1.2 The problem with single-table databases 222
12.1.3 Solving the one-table problem 222
12.1.4 Database system: a definition 224
12.2 Available Database Systems 224
12.2.1 Personal database systems 225
12.2.2 Enterprise database systems 225
12.2.3 Open source database systems 225
12.3 SQL: the Language of Databases 226
12.3.1 Defining data with SQL 226
12.3.2 Manipulating data with SQL 227
12.4 A Database Case Study: MER 227
12.4.1 The requirement for the MER database 231
12.4.2 Installing a database system 232
12.4.3 Creating the MER database 233
12.4.4 Adding tables to the MER database 235
12.4.5 Preparing SWISS-PROT data for imprtation 238
12.4.6 Imprting tab-delimited data into proteins 245
12.4.7 Working with the data in proteins 246
12.4.8 Adding another table to the MER database 248
12.4.9 Preparing EMBL data for imortation 249
12.4.10 Imorting tab-delimited data into dnas 253
12.4.11 Working with the data in dnas 253
12.4.12 Relating data in one table to that in another 254
12.4.13 Adding the crossrefs table to the MER database 255
12.4.14 Preparing cross references for imprtation 256
12.4.15 Imprting tab-delimited data into crossrefs 259
12.4.16 Working with the data in crossrefs 259
12.4.17 Adding the citations table to the MER database 263
12.4.18 Preparing citation information for impotation 265
12.4.19 Impoting tab-delimited data into citations 268
12.4.20 Working with the data in citations 268
Where to from Here 269
The Maxims Repeated 269
13 Databases and Perl 273
13.1 Why Program Databases? 273
13.2 Perl Database Technologies 274
13.3 Preparing Perl 275
13.3.1 Checking the DBI installation 275
13.4 Programming Databases with DBI 276
13.4.1 Developing a database utility module 279
13.4.2 Improving upon dump results 280
13.5 Customising Output 282
13.6 Customising Input 285
13.7 Extending SQL 289
Where to from Here 292
The Maxims Repeated 292
III Working with the Web 295
14 The Sequence Retrieval System 297
14.1 An Example of What’s Possible 297
14.2 Why SRS? 298
14.3 Using SRS 298
Where to from Here 300
The Maxims Repeated 300
15 Web Technologies 303
15.1 The Web Development Infrastructure 303
15.2 Creating Content for the WWW 305
15.2.1 The static creation of WWW content 308
15.2.2 The dynamic creation of WWW content 308
15.3 Preparing Apache for Perl 310
15.3.1 Testing the execution of server-side programs 312
15.4 Sending Data to a Web Server 315
15.5 Web Databases 320
Where to from Here 327
The Maxims Repeated 327
16 Web Automation 329
16.1 Why Automate Surfing? 329
16.2 Automated Surfing with Perl 330
Where to from Here 335
The Maxims Repeated 336
IV Working with Applications 337
17 Tools and Datasets 339
17.1 Introduction 339
17.2 Sequence Databases 340
17.2.1 Understanding EMBL entries 343
17.2.2 Understanding SWISS-PROT entries 346
17.2.3 Summarising sequences databases 347
17.3 General Concepts and Methods 347
17.3.1 Predictions and validation 348
17.3.2 True/False/Negative/Positive 348
17.3.3 Balancing the errors 351
17.3.4 Using multiple algorithms to improve performance 352
17.3.5 tRNA-ScanSE, a case study 353
17.4 Introducing Bioinformatics Tools 357
17.4.1 ClustalW 358
17.4.2 Algorithms and methods 359
17.4.3 Installation and use 360
17.4.4 Substitution/scoring matrices 361
17.5 BLAST 362
17.5.1 Installing NCBI-BLAST 364
17.5.2 Preparation of database files for faster searching 365
17.5.3 The different types of BLAST search 369
17.5.4 Final words on BLAST 371
Where to from Here 371
The Maxims Repeated 371
18 Applications 373
18.1 Introduction 373
18.2 Scientific Background to Mer Operon 374
18.2.1 Function 374
18.2.2 Genetic structure and regulation 374
18.2.3 Mobility of the Mer Operon 375
18.3 Downloading the Raw DNA Sequence 377
18.4 Initial BLAST Sequence Similarity Search 378
18.5 GeneMark 380
18.5.1 Using BLAST to identify specific sequences 382
18.5.2 Dealing with false negatives and missing proteins 386
18.5.3 Over-predicted genes and false positives 387
18.5.4 Summary of validation of GeneMark prediction 388
18.6 Structural Prediction with SWISS-MODEL 388
18.6.1 Alternatives to homology modelling 390
18.6.2 Modelling with SWISS-MODEL 390
18.7 DeepView as a Structural Alignment Tool 396
18.8 PROSITE and Sequence Motifs 401
18.8.1 Using PROSITE patterns and matrices 402
18.8.2 Downloading PROSITE and its search tools 403
18.8.3 Final word on PROSITE 407
18.9 Phylogenetics 407
18.9.1 A look at the HMA domain of MerA and MerP 407
Where to from Here? 410
The Maxims Repeated 411
19 Data Visualisation 413
19.1 Introducing Visualisation 413
19.2 Displaying Tabular Data Using HTML 415
19.2.1 Displaying SWISS-PROT identifiers 417
19.3 Creating High-quality Graphics with GD 422
19.3.1 Using the GD module 424
19.3.2 Displaying genes in EMBL entries 426
19.3.3 Introducing mogrify 429
19.4 Plotting Graphs 431
19.4.1 Graph-plotting using the GD::Graph modules 432
19.4.2 Graph-plotting using Grace 433
Where to from Here 439
The Maxims Repeated 439
20 Introducing Bioperl 441
20.1 What is Bioperl? 441
20.2 Bioperl’s Relationship to Project Ensembl 442
20.3 Installing Bioperl 442
20.4 Using Bioperl: Fetching Sequences 444
20.4.1 Fetching multiple sequences 445
20.4.2 Extracting sub-sequences 447
20.5 Remote BLAST Searches 448
20.5.1 A quick aside: the blastcl3 NetBlast client 449
20.5.2 Parsing BLAST outputs 450
Where to from Here 451
The Maxims Repeated 452
A Appendix A 453
B Appendix B 457
C Appendix C 459
D Appendix D 461
E Appendix E 467
F Appendix F 471
Index 475


Another Bioinformatics Books
Another Web Programming Books
Another Perl Books
Download

No comments:

Post a Comment

Related Posts with Thumbnails

Put Your Ads Here!