Articles from Demo

Learn To Use The VO

2024-08-14 Markus Demleitner

The first 60 pages of the lecture notes as they currently are. I give you a modern textbook would probably look a bit more colorful from this distance, but perhaps this will still do.

About ten years ago, I had planned to write something I tentatively called VadeVOcum: A guide for people wanting to use the Virtual Observatory somewhat more creatively than just following and slightly adapting tutorials and use cases. If you will, I had planned to write a textbook on the VO.

For all the usual reasons, that project never went far. Meanwhile, however, GAVO's courses on ADQL and on pyVO grew and matured. When, some time in 2021, I was asked whether I could give a semester-long course “on the VO”, I figured that would be a good opportunity to finally make the pyVO course publishable and complement the two short courses with enough framing that some coherent story would emerge, close enough to the VO textbook I had in mind in about 2012.
Teaching Virtual Observatory Matters

The result was a course I taught at Universität Heidelberg in the past summer semester together with Hendrik Heinl and Joachim Wambsganss. I have now published the lecture notes, which I hope are textbooky enough that they work for self-study, too. But of course I would be honoured if the material were used as a basis of similar courses in other places. To make this simpler, the sources are available on Codeberg without relevant legal restrictions (i.e., under CC0).

The course currently comprises thirteen “lectures”. These are designed so I can present them within something like 90 minutes, leaving a bit of space for questions, contingencies, and the side tracks. You can build the slides for each of these lectures separately (see the .pres files in the source repository), which makes the PDF to work while teaching less cumbersome. In addition to that main trail, there are seven “side tracks”, which cover more fundamental or more general topics.

In practice, I sprinkled in the side tracks when I had some time left. For instance, I showed the VOTable side track at the ends of the ADQL 2 and ADQL 3 lectures; but that really had no didactic reason, it was just about filling time. It seemed the students did not mind the topic switches to much. Still, I wonder if I should not bring at least some of the side tracks, like those on UCDs, identifiers, and vocabularies, into the main trail, as it would be unfortunate if their content fell through the cracks.

Here is a commented table of contents:
- Introduction: What is the VO and why should you care? (including a first demo)
- Simple Protocols and their clients (which is about SIAP, SSAP, and SCS, as well as about TOPCAT and Aladin)
- TAP and ADQL (that's typically three lectures going from the first SELECT to complex joins involving subqueries)
- Interlude: HEALPix, MOC, HiPS (this would probably be where a few of the other side tracks might land, too)
- pyVO Basics (using XService objects and a bit of SAMP, mainly along an image discovery task)
- pyVO and TAP (which is developed around a multi-catalogue SED building case)
- pyVO and the Registry (which, in contrast to the rest of the course, is employing Jupyter notebooks because much of the Registry API makes sense mainly in interactive use)
- Datalink (giving a few pyVO examples for doing interesting things with the protocol)
- Higher SAMP Magic (also introducing a bit of object oriented programming, this is mainly about tool building)
- At the Limit: VO-Wide TAP Queries (cross-server TAP queries with query building, feature sensing and all that jazz; I admit this is fairly scary and, well, at the limit of what you'd want to show publicly)
- Odds and Ends (other pyVO topics that don't warrant a full section)
- Side Track: Terminology (client, server, dataset, data collection, oh my; I had expected this to grow more than it actually did)
- Side Track: Architecture (a deeper look at why we bother with standards)
- Side Track: Standards (a very brief overview of what standards the IVOA has produced, with a view of guiding users away from the ones they should not bother with – and perhaps towards those they may want to read after all)
- Side Track: UCDs (including hints on how to figure out which would denote a concept one is interested in)
- Side Track: Vocabularies (I had some doubts whether that is too much detail, but while updating the course I realised that vocabularies are now really user-visible in several places)
- Side Track: VOTable (with the intention of giving people enough confidence to perform emergency surgery on VOTables)
- Side Track: IVOA Identifiers (trying to explain the various ivo:// URIs users might see).
Pitfalls: Technical, Intellectual, and Spiritual

The course was accompanied by lab work, again 90 minutes a week. There are a few dozen exercises embedded in the course, and in the lab sessions we worked on some suitable subset of those. With the particular students I had and the lack of grading pressure, the fact that solutions for most of the exercises come with the lecture notes did not turn out to be a problem.

The plan was that the students would explain their solutions and, more importantly, the places they got stuck in to their peers. This worked reasonably well in the ADQL part, somewhat less for the side tracks, and regrettably a lot less well in the pyVO part of the course. I cannot say I have clear lessons to be learned from that yet.

A piece of trouble for the student-generated parts I had not expected was that the projector only interoperated with rather few of the machines the students brought. Coupling computers and projectors was occasionally difficult even in the age of universal VGA. These days, even in the unlikely event one has an adapter for the connectors on the students' computers, there is no telling what part of a computer screen will end up on the wall, which distortions and artefacts will be present and how much the whole thing will flicker.

Oh, and better forget about trying to fix things by lowering the resolution or the refresh rate or whatever: I have not had one instance during the course in which any plausible action on the side of the computer improved the projected image. Welcome to the world of digital video signals. Next time around, I think I will bring a demonstration computer and figure out a way in which the students can quickly transfer their work there.

Talking about unexpected technical hurdles: I am employing PDF-attached source code quite extensively in the course, and it turned out that quite a few PDF clients in use no longer do something reasonable with that. With pdf.js, I see why that would be, and it's one extra reason to want to avoid it. But even desktop readers behaved erratically, including some Windows PDF reader that had the .py extension on some sort of blacklist and refused to store the attached files on grounds that they may “damage the computer”. Ah well. I was tempted to have a side track on version control with git when writing the course. This experience is probably an encouragement to follow through with that and at least for the pyVO part to tell students to pull the files out of a checkout of the course's source code.

Against the outline in the lecture as given, I have now promoted the former HEALPix side track to an interlude session, going between ADQL and pyVO. It logically fits there, and it was rather popular with the students. I have also moved the SAMP magic lecture to a later spot in the course; while I am still convinced it is a cool use case, and giving students a chance to get to like classes is worthwhile, too, it seems to be too much tool building to have much appeal to the average participant.

Expectably, when doing live VO work I regularly had interesting embarrassments. For instance, in the pyvo-tap lecture, where we do something like primitive SEDs from three catalogues (SDSS, 2MASS and WISE), the optical part of the SEDs was suddenly gone in the lecture and I really wondered what I had broken. After poking at things for longer than I should have, I eventually promised to debug after class and report next time, only to notice right after the lecture that I had, to make some now-forgotten point, changed the search position – and had simply left the SDSS footprint.

But I believe that was actually a good thing, because showing actual errors (it does not hurt if they are inadvertent) and at least brief attempts to understand them (and, possibly later, explain how one actually understood them) is a valuable part of any sort of (IT-related) education. Far too few people routinely attempt to understand what a computer is trying to tell them when it shows a message – at their peril.

Reruns, House Calls, TV Shows

Of course, there is a lot more one could say about the VO, even when mainly addressing users (as opposed to adopters). An obvious addition will be a lecture on the global dataset discovery API I have recently discussed here, and I plan to write it when the corresponding code will be in a pyVO release. I am also tempted to have something on stilts, perhaps in a side track. For instance, with a view to students going on to do tool development, in particular stilts' validators would deserve a few words.

That said, and although I still did quite a bit of editing based on my experiences while teaching, I believe the material is by and large sound and up-to-date now. As I said: everyone is welcome to the material for tinkering and adoption. Hendrik and I are also open to give standalone courses on ADQL (about a day) or pyVO (two to three days) at astronomical institutes in Germany or elsewhere in not-too remote Europe as long as you house (one of) us. The complete course could be a 10-days block, but I don't think I can be booked with that[1].

Another option would be a remote-teaching version of the course. Hendrik and I have discussed whether we have the inclination and the resources to make that happen, and if you believe something like that might fit into your curriculum, please also drop us a note.

And of course we welcome all sorts of bug reports and pull requests on codeberg, first and foremost from people using the material to spread the VO gospel.

[1] Well… let me hedge that I don't think I'd find a no in myself if the course took place on the Canary Islands…

Category: Demo
Updates to GAVO's Tutorials

2023-03-23 Markus Demleitner

Over the years, GAVO has produced a number of VO tutorials, i.e., texts that introduce some technique related to using the Virtual Observatory, preferably within some halfway plausible scenario. In effect, they are software documentation, and as software itself, software documentation suffers from bit rot. To work against that, the tutorials have to be revised occasionally.

My two student assistants Sonja Gabriel and Chuanming Mao have recently done some of that revising. Let me use this opportunity to show off some of these freshly polished tutorials.

A classic one (that has, if I may say so myself, aged rather well), is Adding catalog data to object lists using the VO. This is a thinly disguised introduction to TAP uploads, arguably the most powerful of all the VO tech to date. If you have come to this place without ever having done a TAP upload, you owe it to yourself to at least skim the tutorial and quickly follow along the few steps to do positional crossmatches with just about any astronomical catalog and with just about any level of sophistication.

Another classic – it has its roots in the original Italian VO Days[1] – is TOPCAT and Aladin working together. It is using SDSS data of some galaxy cluster to try and get you to to send around data and positions between different programs using SAMP. If you are reading VO blogs, it is not unlikely this kind of thing will make you yawn. But at VO Days, it's little things like this that usually most immediately appeal to students and researchers alike.

From a tech point of view, Explore the Pleiades with TOPCAT and Aladin also mainly looks at SAMP (perhaps even somewhat less convincingly), but it's such a striking demo of what an amazing instrument Gaia is, and it's a nice introduction to TOPCAT's VO interface and subsetting facility that it's definitely worth a look, in particular as a showcase of having instant results with the VO.

An entirely different topic (well: it also employs SAMP for a moment) is covered by Data Discovery Using the Virtual Observatory Registry. This is trying to motivate looking for data collections in the VO Registry (in the form of our Browser interface to it). This tutorial has grown quite a bit during the review and now includes two sections joining data from different resources for various purposes. One section illustrates how systematics of quasar redshifts might be looked into using different sources, the other investigates the Tully-Fisher relationship in different spectral bands.

The tutorial on Asteroids in the Solar System was entriely overhauled. It was (and still is) mainly intended to be used in schools, and thus it originally just built on things that ran in a web browser. As is typical of things in web browsers, they have long since vanished. Hence, a rather fundamental update was necessary anyway. While we were looking for interesting things to do – the plot above, by the way, is the distribution of major halfaxes in the Kuiper belt –, we ended up even includeding a brief bit on ADQL.

Due to its school focus, we are also offering this particular text in German as well as in English. If you are an Astronomy teacher with particularly motivated pupils , we would like to hear from you…

The last revised tutorial I would like to mention also has a somewhat special (main) target audience: Astrometric Calibration using Aladin. Admittedly, automatic, or “blind” calibration has become really great, and I think getting their images located on the sky is not much of a problem even for amateurs any more, thanks in part to services like astrometry.net. But then – sometimes there is nothing like a good, old manual, ummm, “plate” solution. Aladin and the VO make that lot less tedious than it used to be.

Of course, I cannot have a post on tutorials without mentioning the VO Text Treasures, a web page that shows the educational material currently registered in the VO Registry. This little page also accounts for bit rot: You can sort by the time last inspected there, and thanks to Sonja's and Chuanming's efforts, our tutorials look very good in that representation at the moment.

In case you have some material suitable for WIRR yourself: Please register it, too. Send me a mail and I will lend you a hand (or, if you are a VO pro, directly read the pertinent standard).

[1] That's block courses on VO matters lasting a day or two. If you are in Germany, you can book us for your very own one!

Category: Demo

HEALPix Maps: In General and in Gaia

2022-12-22 Markus Demleitner

blue and reddish pixels drawing a bar on the sky.

A map of average Gaia colours in HEALPixes 2/83 and 2/86 (Orion south-east). This post tells you how to (relatively) quickly produce such maps.

Making HEALPix maps with Gaia source_ids
HEALPix to Screen Pixel
Positional Constraints using source_ids
Aggregating over a Non-HEALPix

This year's puzzler for the AG Tagung turned out to be a valuable source of interesting ADQL queries. I have already written about finding dusty spots on the sky, and in the puzzler solution, I had promised some words on creating dust maps, or, more generally, HEALPix maps of any sort.

Making HEALPix maps with Gaia source_ids

The basic technique is explained in Mark Taylor's classical ADASS poster from 2016. On GAVO's TAP service (access URL http://dc.g-vo.org/tap), you will also find an example for that (in TOPCAT's TAP window, check the Service-provided section unter the Examples button for it). However, once you have Gaia source_ids, there is something a lot faster and arguably not much less convenient. Let me quote the footnote on source_id from my DR3 lite table:

For the contents of Gaia DR3, the source ID consists of a 64-bit integer, least significant bit = 1 and most significant bit = 64, comprising:

a HEALPix index number (sky pixel) in bits 36 - 63; by definition the smallest HEALPix index number is zero.

[…]

This means that the HEALpix index level 12 of a given source is contained in the most significant bits. HEALpix index of 12 and lower levels can thus be retrieved as follows:

[...]

HEALpix [at] level n = (source_id)/(2³⁵⋅4^12 − n).

That is: Once you have a Gaia source_id, you an compute HEALpix indexes on levels 12 or less by a simple integer division! I give you that the more-than-35-bit numbers you have to divide by do look a bit scary – but you can always come back here for cutting and pasting:

HEALPix level	Integer-divide source id by
12	34359738368
11	137438953472
10	549755813888
9	2199023255552
8	8796093022208
7	35184372088832
6	140737488355328
5	562949953421312
4	2251799813685248
3	9007199254740992
2	36028797018963968

If you know – and that is very valuable knowledge far beyond this particular application – that you can simply jump between HEALPix indexes of different levels by multiplying with or integer-dividing by four, the general formula in the footnote actually becomes rather memorisable. Let me illustrate that with an example in Python. HEALPix number 3145 on level 6 is:

>>> 3145//4  # ...within this HEALPix on level 5...
786
>>> 3145*4, (3145+1)*4  # ..and covers these on level 7...
(12580, 12584)

Simple but ingenious.

You can immediately exploit this to make HEALPix maps like the one in the puzzler. This piece of ADQL does the job within a few seconds on the GAVO DC TAP service[1]:

SELECT source_id/8796093022208 AS pix,
  AVG(phot_bp_mean_mag-phot_rp_mean_mag) AS avgcol
FROM gaia.edr3lite
WHERE distance(ra, dec, 246.7, -24.5)<2
GROUP by pix

Using the table above, you see that the horrendous 8796093022208 is the code for HEALPix level 8. When you remember (and you should) that HEALPix level 6 corresponds to a linear dimension of about 1 degree and each level is a factor of two in linear dimension, you see that the map ought to have a resolution of about 1/8th of a degree.

HEALPix to Screen Pixel

How do you plot this? Well, in TOPCAT, do Graphics → Sky Plot, and then in the plot window Layers → Add HEALPix control (there are icons for both of these, too). You then have to manually configure the plot for the table you just retrieved: Set the Level to 8, the index to pix and the Value to avgcol – we're working on making the annotation a bit richer so that TOPCAT has a chance to figure this out by itself.

With a bit of extra configuration, you get the following map of average colours (really: dust concentration):

Plot: Black and reddish pixels showing a bit of structure

This is not totally ideal, as at the border of the cone, certain Healpixes are only partially covered, which makes statistics unnecessarily harder.

Positional Constraints using source_ids

Due to Gaia's brilliant numbering scheme, we can do analysis by HEALpix, too, circumventing (among other things) this problem. Say you are interested in the vicinity of the M42 and would like to investigate a patch of about 8 degrees. By our rule of thumb, 8 degrees is three levels up from the one-degree level 6. To find the corresponding HEALpix index, on DaCHS servers with their gavo_simbadpoint UDF you could say:

SELECT TOP 1 ivo_healpix_index(3, gavo_simbadpoint('M42'))
FROM tap_schema.tables

Hu, you ask, what's tap_schema.tables to do with this? Well, nothing, really. It's just that ADQL's syntax requires selecting from a table, even if what we select is completely independent of any table, as for instance the index of M42's 3-HEALpix. The hack above picks in a table guaranteed to exist on all TAP services, and the TOP 1 makes sure we only compute the value once. In case you ever feel the need to abuse a TAP service as a calculator: Keep this trick in mind.

The result, 334, you could also have found more graphically, as follows:

Start Aladin
Check Overlay → HEALPix grid
Enter M42 in Command
Zoom out until you see HEALPix indexes of level 3 in the grid.

An advantage you have with this method: You see that M42 happens to lie on a border of HEALPixes; perhaps you should include all of 334, 335, 356, and 357 if you were really interested in the Orion Nebula's vicinity.

We, on the other hand, are just interested in instructive examples, and hence let's just repeat our colour mapping with all Gaia objects from HEALPix 3/334. How do you select these? Well, by source_id's construction, you know their source_ids will be between 334⋅9007199254740992 and (334 + 1)⋅9007199254740992 − 1:

SELECT source_id/8796093022208 AS pix,
  AVG(phot_bp_mean_mag-phot_rp_mean_mag) AS avgcol
FROM gaia.edr3lite
WHERE source_id BETWEEN 334*9007199254740992 AND 335*9007199254740992-1
GROUP by pix

This is computationally cheap (though Postgres, not being a column store still has to do quite a bit of I/O; note how much faster this query is when you run it again and all the tuples are already in memory). Even going to HEALPix level 2 would in general still be within our sync time limit. The opening figure was produced with the constraint:

source_id BETWEEN 83*36028797018963968 AND 84*36028797018963968-1
OR source_id BETWEEN 86*36028797018963968 AND 87*36028797018963968-1

– and with a sync query.

Aggregating over a Non-HEALPix

One last point: The constraints we have just been using are, in effect, positional constraints. You can also use them as quick and in some sense rather unbiased sampling tools.

For instance, if you would like so see how the reddening in one of the “dense“ spots in the opening picture behaves with distance, you could first pick a point – α = 98, δ = 4, say –, then convert that to a level 7 healpix as above (that's/88974) and then write:

SELECT ROUND(r_med_photogeo/200)*200 AS distbin, COUNT(*) as n,
    AVG(phot_bp_mean_mag-phot_rp_mean_mag) AS avgcol
FROM gaia.dr3lite
JOIN gedr3dist.main USING (source_id)
WHERE source_id BETWEEN 88974*35184372088832 and 88975*35184372088832-1
GROUP BY distbin

This is creating 200 pc bins in distance based on the estimates in the gedr3dist.main table (note that this adds subtle correlations, because these estimates already contain Gaia colour information). Since quite a few of these bins will be very sparsely populated, I'm also fetching the number of objects contributing. And then I plot the whole thing, using the conventional √(n) ⁄ n as a rough estimate for the relative error:

Plot: A line that first slowly declines, then rises quite a bit, then flattens out and becomes crazy as errors start to dominate.

This plot immediatly shows that colour systematics are not exclusively due to dust, as in that case things would only get redder all the time. The blueward trend up to 700 pc is reasonably well explained by the brighter, bluer upper main sequence becoming more dominant in the population sampled as red dwarfs become too faint for Gaia.

The strong reddening setting in after that is rather certainly due to the Orion complex, though I would perhaps not have expected it to reach out to 2 kpc (the conventional distance to M42 is about 0.5 kpc); without having properly thought about it, I'll chalk it off as “the Orion arm“. And after that, it's again what I'd call Malmquist-blueing until the whole things dissolves into noise.

In conclusion: Did you know you can group by both healpix and distbin at the same time? I am sure there are interesting structures to be found in what you will get from such a query…

[1] You may be tempted to write source_id/(POWER(2, 35)*POWER(4, 3) here for clarity. Resist that temptation. POWER returns floating point numbers. If you have one float in a division, not even a ROUND will get you back into the integer division realm, and the whole trick implodes. No, you will need the integer literals for now.

Computing Residuals of an Astrometric Calibration

2022-11-29 Markus Demleitner

Two plots, left a fairly good correlation, right a cloudy wave

The kind of plot you can make following the recipe given here: Left, a comparison of the photometry, right, a positional residuals, not taking into account the SIP plate solution, when comparing the HDAP plate B3261a against Gaia DR3. Note that the cut-off a 4 arcsec is because of the match radius when obtaining the calibrator stars.

I recently had to assess the quality of the astrometric calibration of a photographic plate. What I am going to show you in this post will of course work just as well for CCD frames, and if these have a sufficiently large field of view, this may be an issue for them as well. However, the sort of data that needs this assessment most typically are scans of plates, as these tend to have a “wobble”, systematic offsets in the scan direction resulting from imperfections in the mechanics.

Prerequisites: An astronomical frame with a calibration in ICRS (or some frame not very far from it), called my-image.fits in the following, SExtractor (in Debian and derivatives: apt install source-extractor – long live Debian Astro; since it's called source-extractor in Debian, that's what I'll use here, too), and of course TOPCAT.

Step 1: Extract Sources. Source extraction is of course a high science, and if you know better than me, by all means do it the way you think is appropriate. Meanwhile, the following might very well work for you sufficiently well.

Create a working directory and enter it. Then, to create a file telling source-extractor what columns you would like to see, write the following to a file default.param:

ALPHA_SKY
DELTA_SKY
X_IMAGE
Y_IMAGE
MAG_ISO
FLUX_AUTO
ELONGATION

Next, give a few parameters to source-extractor; depending on the sort of image you have, you may want to play around with DETECT_MINAREA (how many pixels need to show a signal to register as a source) and DETECT_THRESH (how many sigmas a pixel has to be above the background to register as a candidate for belonging to a source). Meanwhile, write the following into a file default.control:

CATALOG_TYPE     FITS_1.0
CATALOG_NAME     img.axy
PARAMETERS_NAME  default.param
FILTER           N
DETECT_MINAREA   30
DETECT_THRESH    4
SEEING_FWHM      1.2

– but if the following call gives you a few hundred sources, that ought to work for the present purpose.

Then run:

source-extractor -c default.control my-image.fits

This will give you a catalogue of extracted objects in the file img.axy.

Step 2: Fix source-extractor's output. Load that img.axy into TOPCAT. Regrettably, source-extractor does not add any useful metadata to the columns of its output table. To add the absolute bare minimum, in TOPCAT go to Views → Column Info. In that window, check UCD in the Display menu, and then put pos.eq.ra and pos.eq.dec into the UCD fields of the ALPHA_SKY and DELTA_SKY columns, respectively; double click to change fields in TOPCAT.

To see if you have done the annotation right, in TOPCAT's main window, click Graphics → Sky Plot. If the objects show up, you have just provided enough annotation to let TOPCAT figure out the position for each row.

Step 3: Get calibrators. We will now try to add counterparts for Gaia DR3 to the extracted sources. To do that, click VO → Table Access Protocol, and in the window popping up double click the entry for the GAVO DC TAP.

In the Find box, type dr3lite to look for this site's version of the Gaia DR3 source catalogue. Click on gaia.dr3lite to select that table, and then select the Columns pane. This should show some of the Gaia DR3 columns.

Now Examples → Upload Join will generate a query that will cross-match your extracted sources with the Gaia sources. You should edit it a bit, only selecting the columns you will actually need, removing the TOP 1000 (at least on large images with more than 1000 sources), and reducing the match radius a bit when the calibration is not actually completely off and your epoch is sufficiently close to J2000.

Hint: you can control-click in the Columns pane and then use the Cols button to insert all the column names in one go[1]. For me, the resulting query would be:

SELECT
   source_id, ra, dec, phot_bp_mean_mag,
   tc.*
   FROM gaia.dr3lite AS db
   JOIN TAP_UPLOAD.t1 AS tc
   ON 1=CONTAINS(POINT('ICRS', db.ra, db.dec),
                 CIRCLE('ICRS', tc.ALPHA_SKY, tc.DELTA_SKY, 4./3600.))

This should result in about as many matches as your extraction had – a few more is ok, because you will have some spurious matches, a few less is ok, too, as there are always some outliers and artefacts, but you should clearly not pull a magnitude more or less objects here than you put in; fiddle with the match radius as necessary.

See if there is a rough correlation between the Gaia calibrators and your extracted sources by plotting phot_bp_mean_mag against MAG_ISO. Absent more information, MAG_ISO, source-extractor's guess for the magnitude of the extracted object, will be just some crazy number, but it should have some discernable correlation with the actual magnitude. Do not expect too much here, in particular with old plates, for which good photometry is a science of their own.

In my example, this looked like this:

Plot: a rough correlation in red with a green tail

The green points certainly are spurious matches; this observation did not reach beyond 14th magnitude or so, and there are many weak stars on the sky, so a few of them will show up in just about any cross match. See the opening picture for an example with a better correlation.

Step 4: Do the correlation plot. Do Graphics → Plane Plot and then plot ra-alpha_sky or dec-delta_sky against X_IMAGE or Y_IMAGE. You could get something like this:

This rather certainly reflects some optical distortion; source-extractor regrettably does not take into account SIP corrections yet, so it is likely that a large part of this would be taken care of by the polynomials of the plate solution (the github issue I am linking to tells you how to be sure).

But it can also look like this:

This certainly is not the result of a lens or anything optical at all. It's the scanner's gears that you are looking at here. With an amplitude of perhaps three arcseconds this is rather excessive here; but something like this you will rather likely see even on good scanners – though it may essentially be invisible, as of the Heidelberg scanner we used for HDAP:

Plot: A vertical cloud with no discernible structure.

[1]	I'm using the BP magnitude in the query below as most historical plates tend to be “blue sensitive“ (in some sense). Hence, BP magnitudes should be a bit closer to what source-extractor has extracted.

A Proposed Vector Extension for ADQL

2022-10-07 Markus Demleitner

Getting Spectra for Wolf-Rayet Stars
Investigating the spectra
Computing some statistics
Computing a Template on the Server Side
Finding Similar Objects
Slices?

When I showed off my rendering of the Gaia DR 3 XP spectra a month ago, I promised I would later show how my proposal for a Vector extension to ADQL would enable quite a bit of interesting functionality on that table. Let me make good on this promise with a little project to find candidates for Wolf-Rayet stars, more specifically, WC stars. I give you that's a bit cheap because they have very distinctive spectral features, but then this is supposed to be a quick educational posting, not a science paper.

Before I start, I probably should stress that in this context I am using the word “vector” like a computer guy would. We are talking about one-dimensional arrays here, not about the vectors the mathematicians have given us (as in “elements of vector spaces“).

Getting Spectra for Wolf-Rayet Stars

Let us first produce a list of Wolf-Rayet stars (of any denomination) using SIMBAD. So, start TOPCAT, open the TAP dialog and find the SIMBAD TAP server. Run:

SELECT main_id, ra, dec
FROM basic
WHERE otype='WR*'

there[1]. In the next step, we will need the Gaia DR3 source_ids for these objects, and so it would be nice to pull them immediately from Simbad; you could in principle do that by running:

SELECT main_id, ra, dec, id
FROM basic
  JOIN ident
  ON (oid=oidref)
WHERE otype='WR*'
  AND id LIKE 'Gaia DR3%'

– but for one, fiddling out the actual source_ids from the strings you get in the id column is a bit tedious, and then quite a few of these objects don't have Gaia ids yet: The first query returns 1548 at the moment, the second 696.

If we had the source_ids, we could immediately join with the gdr3spec.spectra table at the GAVO DC TAP (http://dc.g-vo.org/tap). As I mentioned a month ago, this is a physical table just consisting of the source id and arrays of flux and flux errors. There is also gdr3spec.withpos that has positions and the spectra; but that's a view, and for the time being that means that the planner will quite likely get confused when positional constraints come into play[2]. The result would be queries running for half an hour when a few seconds would do just as well.

On services already supporting ADQL 2.1, you can usually work around problems of this kind by re-writing your query to use CTEs (“WITH”), because these often work as planner barriers. In the present case, we first get source_ids for our Simbad objects in a CTE and then use these to join with our spectra table, like this:

WITH wrids AS (
        SELECT source_id, main_id
        FROM gaia.edr3lite AS l
    JOIN tap_upload.t1 AS u
        ON (DISTANCE(l.ra, l.dec, u.ra, u.dec)<0.001))
SELECT main_id, source_id, flux, flux_error
FROM gdr3spec.spectra
JOIN wrids USING (source_id)

As usual, you will probably have to adapt the number in what is tap_upload.t1 here to the table index you have for your SIMBAD result.

This yields 574 spectra at the moment, within a few seconds. Spectra for about a third of our collection of objects: I'd say that's quite impressive.

Investigating the spectra

The September post already discussed a few aspects of array plotting. In short: try the plane plot and an XYArray control. Modern TOPCATs (and for what I'm doing here it's wise to use something newer than 4.8-7) will automatically figure out suitable columns for the x and y axes (except it forgets to label the spectral coordinate with its unit, nm):

There's a quite a bit of crowding here; finding global characteristics perhaps works better when you switch the ordinate to logarithmic, use transparent shading (in the Form tab) and raise the opaque limit a bit. This could give you something like:

You can see that at least quite a few objects have nice and strong emission lines, as I had hoped for when choosing WC stars for this example. What if we could pick them out to build a template spectrum? Well, let's try. With the new vector math in ADQL, the database can normalise the spectra and compute a few statistics on them.

But first: In order to get rid of the source_id-computing CTE above, let me obtain the source_ids I want to work with once and for all, as in:

SELECT source_id, main_id
FROM gaia.edr3lite AS l
  JOIN tap_upload.t1 AS u
        ON (DISTANCE(l.ra, l.dec, u.ra, u.dec)<0.001)

Memorise the Table List index of the result. With that, you can directly work with the gdr3spec.spectra table and experiment a bit; for me, the index was 10, and hence I use tap_uploads.t10 below.

Computing some statistics

The Gaia XP spectra are flux calibrated, and hence I will have to normalise them if I want to compare them. Ignoring all errors and thus in particular the fact that some (few) components are negative, this normalisation is harmless: We just divide by the sum of all vector components. The net result is that, were our spectra continuous, the integral over them would be one. And let's then use the standard deviation and the value of the 19th array element as metrics:

WITH normalised AS (
SELECT source_id, main_id,
        flux/arr_sum(flux) as nflux, flux
FROM gdr3spec.spectra
JOIN tap_upload.t10
USING (source_id))
SELECT
  source_id, main_id,
  flux, nflux,
  arr_avg(nflux*nflux)-POWER(arr_avg(nflux),2) AS sd,
  nflux[19] as em
FROM normalised

You can see quite a bit of the vector extension here:

arr_sum and arr_avg: These work as the aggregate functions in SQL do, just not on tuples but on the components of the vectors.
Multiplication of vectors is element-wise, so nflux*nflux computes a vector of the squares of the components of nflux. That's also true for all other basic arithmetic operators. If you know numpy: same thing.
You fetch individual elements in the [] notation you probably know from Python or C. Contrary to Python and C, common SQL implementations count indexes from 1 by default, and we are keeping that here (for now). Fortran lovers rejoice!

Why did I use nflux[19]? Well, there is a reasonably strong emission feature in many spectra at about 580 nm (it's three-time ionised carbon, or C IV in the rotten notation I usually apologise for when talking to non-astronomers), which is, as experts tell me (thanks, Andreas!), rather characteristic for the WC stars that I'm after (whereas the even stronger feature on the left, around 470 nm, can also be Helium).

If you inspect the spectral coordinate that TOPCAT has on the abscissa (it's a param, so you'll have to go to Views → Table Parameters to see it), you will see that each spectral bin is 10 nm wide. So, I will hopefully hit the 580 nm feature when fetching the 19th element of the spectral vector.

If you plot em versus the sd obtained like that, you will see two reasonably distinct groups, where the ascending arm has relatively strong emission around 580 nm and the descending arm does not. I have used the Blob subset feature to select the upper arm into a subset ”upper” that is blue in the following plot. If you click around in the em-sd plot and show the Activated subset in an array plot, you can see things like:

Screenshot: scatterplot and stacked spectra next to each other

The activated spectrum (shown here in yellow) has a strong Hα but basically no C IV – and it's safely outside of our carbon subset. Click around a bit on the ascending arm, and you will see that all these spectra have a bump around array element 18 (in TOPCAT's count, which starts at 0).

Computing a Template on the Server Side

Whatever the subset of stars that we would like to use to define our group of interest, we would now like to create a template spectrum from them. A plausible way to do that is to sum them all up – that has the nice side effect that stronger sources (which hopefully are less noisy) have a larger weight.

To compute the template, in the Views → Column Info for the sd/em table, unselect all columns but source_id (that way, you only upload what you absolutely must), and in the main window, select the upper subset in the Row Subset combo box. That way, only the rows in that subset will get uploaded in the following query:

SELECT summed/arr_sum(summed) AS tpl
FROM (
  SELECT SUM(flux) AS summed
  FROM gdr3spec.spectra
  JOIN tap_upload.t19
  USING (source_id)) AS q

Again, you will have to adapt the t19 to where your manipulated sd/em table is. If you get an “ambiguous column flux” error (or so) here, you forgot to unselect all columns but source_id in the columns window.

It pays to briefly appreciate what happens here: The SELECT SUM(flux) is an aggregate function over arrays, meaning that all the arrays are being summed over component-wise. Against that, the sum in summed/arr_sum(summed) is summing within the array. If it helps you, you could imagine having all the arrays in the table stacked. Then, SUM(arr) produces the vertical margin sum, and array_sum(arr) procudes the horizontal margin sum.

Well, here's the template I got in this way:

I give you that in this particular case, you could easily have done the computation on the client side, because you already had the spectra in your table. But the technique also works when you don't, and it will scale to millions of arrays (although you will have to carefully think about numerics when doing such enormous sums).

Also – I cannot lie – I simply had to have a pretext for showing you aggregate functions over arrays.

Finding Similar Objects

Now that we have the template, can we find objects that have similar properties? Sure: We upload the array and compute some metric, perhaps the (squared) euclidian distances to normalised spectra. If the template is in TOPCAT's table 25, you can write:

SELECT TOP 200000
  source_id,
  arr_sum(
    arr_map(
      power(x,2),tpl-flux/arr_sum(flux))) AS dist2,
  arr_sum(flux_error/flux) AS errs
FROM gdr3spec.spectra, tap_upload.t25

This will compute the distances between (conceptually randomly drawn) 200000 spectra and your template. I am also requesting the sum of the (relative) flux errors as a measure of how likely it is that wild wiggles actually are just artefacts.

There is one array-related feature in that query I have not yet mentioned: arr_map. This applies an expression to all components to a vector, pretty much like python's map function, and my attempt to have some (perhaps somewhat lame) substitute for numpy's ufunctions.

I am rather sure we really need something like this. SQL has no notion of defining functions in queries. That is usually welcome, as otherwise it would quicky become Turing-complete, which would be bad for what it is designed to do. Here, however, that is a problem, because we do not have a clean way to write the expression be be computed for each component. For now, I have decreed that the first argument of map is an expression over a formal x. This is ugly not only because it will be confusing when there's an actual x in a table or query. I suspect with a bit more thought and creativity, one can find a better solution that still does not require a re-write of half the SQL grammar. But then let's see – perhaps this makeshift hack proves to be less troublesome than I expect.

Note that on my server, you will only get back 20000 matches by default; you would have to adapt Max Rows to actually retrieve 200'000, and then you also must switch to Asynchronous mode. This will then actually take a non-trivial amount of CPU and disk I/O; going through the entire set of 2e8 rows will be a matter of hours or so. Hence, I'm grateful if you do all-sky scans only after having tested your queries on much smaller subsets.

I've done this for 500'000 rows (which took a few minutes), which might bring up a few C IV-strong WR stars (these beasts are rare, you know). The result in the dist2/errs plane is (logplot zoomed a bit):

Plot: a gigantic point cloud with a few outliers.

Well: at least there are a few promising cases. Which would conclude this little demo for the ADQL vector proposal. Looking at what we have found here is another story.

Still, I could not resist having another look at what my box has found there. There is a rather clear cut in the plot at perhaps 0.009, and thus I created a subset interesting consisting of objects for which dist2<0.009 (which is 18 objects for me) and did the trick above, only uploading this interesting subset with only the source_id column to resolve to Gaia DR3 (lite) rows:

SELECT *
FROM gaia.dr3lite
JOIN tap_upload.t33
USING (source_id)

And then I wondered whether any of these were known to SIMBAD and switched to their TAP service:

SELECT
  tc.*, otype
FROM basic AS db
 RIGHT OUTER JOIN TAP_UPLOAD.t34 AS tc
 ON 1=CONTAINS(POINT('ICRS', db.ra, db.dec),
               CIRCLE('ICRS', tc.ra, tc.dec, 1./3600.))

Note the use of RIGHT OUTER JOIN to ensure we won't lose any matches on the way; if this weren't such a small table, you'd be better off just uploading the positions and then doing a local match to recover the rest of the table, by the way.

As to what's coming back: Well, a bunch of white dwarf candidates, a “blue“ star, a few objects SIMBAD knows as quasars (that at least makes sense, because it's rather likely that some of them have lines redshifted into my C IV window), and a few unclassified objects. Whether SIMBAD is wrong on at least some of them, whether the positional crossmatch fetched unrelated objects, or whether I got it all wrong I will not decide here. Let me give you my candidates as a VOTable, though.

You now know what you have to do to add nice, if low-resolution, spectra to them.

Slices?

A notable absence from the current vector extension is slicing. I think we should have it – in this example, this would be really useful when summing different spectral regions without having to write long sums (“synthetic broadband photometry“).

I have not put it in yet mainly because I am not sure if Python-like syntax (nflux[4:7]) is a good idea when we have 1-based arrays. Also: Do we want to keep the upper index out? That's certainly the right thing for Python (where you want a[:3]+a[3:] == a), but is it here? Speaking of which: Should we require support of half-open slices? Should we rather have a function arr_slice? With what arguments?

I'd be curious about other peoples' thoughts on slicing.

[1]	In case you wonder how I came up with the `WR*`: You can simply run something like: SELECT otype, label FROM otypedef WHERE description LIKE '%Wolf%' Once Simbad upgrades to ADQL 2.1, you probably want to replace the LIKE with ILIKE for robustness.

[2]	For the incurably curious, you can learn more about the underlying problem at https://github.com/segasai/q3c/issues/30.

Page 1 / 4 »