GAVO Blog: Virtual Observatory Matters: Categories

Articles from Standards

A Grey Eminence of a Standard

2018-10-30 Markus Demleitner

Examples for extra metadata: extended column descriptions on the web pages accompanying the ARI-Gaia TAP service.

Last friday, I've uploaded a first working draft of VODataService 1.2 to the IVOA documents repository. That's the first major step in updating a standard, and it's an invitation to everyone to have a look and comment.

Foof, you might say, what do I care? I've not even heard of that standard.

Well, but you've probably used it. VODataService is (among several other things) the standard that governs how a TAP service tells clients (TOPCAT, say) what tables it has and what's inside of them. So, if you see in TOPCAT that there is a column named ang_error with a unit of deg, a UCD of stat.error;pos and the meaning “1 σ confidence radius of the position”, that most likely came in a document standardised by VODataService.

The question of what (TAP) services can tell clients about their table set is one major open point: Do we want additional metadata there? This article's image, for inspiration, shows a screenshot of extended metadata Grégory delivers to browsers on his ARI-Gaia service; among this are minima, maxima, means, standard deviations, quartiles, and fill factors (i.e., how many of the columns are NULL). He even shows histograms of the values' distributions and HEALPix maps showing how (the means of) the values vary on the sky. Another example of extended metadata could be footnotes as you will find them on many of my resources' reference URLs (example; footnotes are, unsurprisingly, near the foot of that page).

We could define interoperable means to communicate information like this. The question is: does the added value justify the complication in implementation? This is where it would be great if you weighed in, in particular if you are a “mere” TAP user: Are there any such pieces of metadata you've always wanted to see in your TAP interfaces? Oh, and metadata of course can also be added to tables rather than columns. The current draft already lets services communicate the number of rows in each table – is there more “simple”, table-specific metadata of this sort?

VODataService furthermore deals with several other topics; for instance, the STC in the registry business I've blogged about in February is going to be standardised here (update on this: spectral coverage is no longer in wavelength but in energy). Other changes are rather more technical in nature, like several new resource types that will improve the discovery of tables and other such resources, or a careful adjustment of some features to keep them in line with TAP evolution.

But don't let the technicalities scare you away – just have a peek, and if you have thoughts on any of the VODataService topics: I'm just a mail away.

Category: Standards
Speak out on ADQL 2.1

2018-03-05 Markus Demleitner

If you've always wanted to be part of a standardisation process within the IVOA (and who would not?), the time has rarely been as good as now. Because: We're updating ADQL! Yes! The ADQL you are writing your queries in will receive a few more language elements, and we're carefully trying to heal a few things that turned out to be warts. And while some of the changes are as dull and boring as you may expect standards work to be, on some of them you may wish to have a saying.

Also, you can try things out – the GAVO data center TAP endpoint at http://dc.g-vo.org/tap already has most of the proposed features, and the new DaCHS beta 1.1.2 (out since last Friday) does, too. So, if you're running DaCHS yourself, you can start playing after switching to the beta repository.

What's new?
- You're now supposed to write the standard crossmatch as DISTANCE(ra1, dec1, ra2, dec2)<dist. This replaces the old dance with 1=CONTAINS(POINT(), CIRCLE()) that you've probably learned to hate. Finally: Crossmatching without having to resort to TOPCAT's example menu...
- ADQL geometries used to require a first argument that would give the reference frame, as in POINT('ICRS', ra, dec). The hope was that services could then automagically make a statement like CONTAINS(point_in_icrs, circle_in_galactic) work as presumably intended. Few services ever did (DaCHS still tries reasonably hard), and when they did, there were all kinds of opaque oddities. One of the most common sources of confusion is the question what a service is supposed to do with POINT('GALACTIC', ra, dec), assuming it knows that ra and dec are in, say, B1950 FK4. Also, is there any expectation that services attempt to do anything beyond a simple rotation (FK4, for instance, rotates noticably against the ICRS, so proper motions would need to get fixed, too)? In all, the frame as a first argument was ill thought-out, and it's been deprecated. Simply don't put in the string-typed first argument any more. POINT(long, lat) does it. True: This, more than ever, calls for an ADQL astrometry library so you can easily convert, at least, between Galactic and ICRS (probably a few more would be useful, too). More on this in some future post.
- Services should have CAST now. Sometimes you want to turn a number into a string or a string into a timestamp. In such cases, you can write CAST('1991-02-01', TIMESTAMP) now. The details are not quite, excuse me, cast in stone yet, so if you have a use case for this kind of thing, speak up now. The current draft also calls for a TIMESTAMP(tx) function – but since that's really not different from CAST(tx, TIMESTAMP), I'm trying to dissuade people from adding it.
- Services should have an IN_UNIT function now. That's a nifty thing in particular when you're re-using queries on different services. Just write, say, IN_UNIT(pmra, 'deg/yr') and never worry again if it's arcsec/yr, mas/yr, rad/cy, or whatever. The second argument, by the way, is written according to the Units in the Virtual Observatory standard. It's an optional feature according to the current standard, so perhaps it's too early to party, but I've found this extremely useful, and so I hope we'll see widespread adoption.
- Services should now have set operations. These are UNION, EXCEPT, and INTERSECT and are useful when you have two queries that result in the same table schema (because they won't work otherwise). Say you have two complex ways to filter rows from the table source, but you want to process both sorts of results further on – you can say then say something like:
```
     SELECT <whatever complex> FROM
         (SELECT a,b,c FROM source
           WHERE <crazy stuff>
           GROUP BY a, b, c) as left
       UNION
         (SELECT a,b,c FROM source
           WHERE <other crazy stuff>
           GROUP BY a, b, c) as right
     WHERE <more complex stuff over a, b, and c>

– and similarly, EXCEPT lets you “punch a hole” in a result table.
Another interesting use case would be to query many tables on a
service like VizieR in one go; that still works if you make sure the
tables defined by the sub-queries have the same columns. Given that a
lot of cross-table operations actually boil down to JOINs and WHERE
clauses, the set operations are used less that one would expect. But
if you need them, there's no real alternative (short of downloading
far too much and performing the operation locally, which of course
defeats the purpose of TAP).
```
- Common table expressions (“WITH”). DaCHS doesn't do these yet, and it will only pick them up if someone else implements them first. In the way ADQL 2.1 has them (“nonrecursive”), CTEs are little more than syntactic sugar, and I'm not quite sure if the additional implementation complexity is worth it. If you're curious, check CTEs in the postgres manual. If that makes you drool for WITH in ADQL, let me know. It'll not be too hard to sway me to put them in.
- Bitwise Operations. That's when integers are treated as bit patterns. If this sounds like nerd stuff to you, well, it happens quite a bit in actual catalogs. See, for instance, Note 3 for the PPMXL. You'd need the flags column described there if you wanted to exclude PPMXL objects that replaced multiple USNO-B1.0 objects (bit 3), you will right now have to write something like MOD(flags,16)>7. That's a bit of magic that everyone will have to think about for a while. With bitwise operations, you'll just write BITWISE_AND(flags,8)=8, which will look familiar to everyone who has used the pattern before (in particular, it's clear we're talking about bit 3). There still is discussion whether bitwise operations are common enough to warrant special syntax – the draft currently says the above should be written as flags&8=8 – or whether the functions DaCHS has at the moment (they're called BITWISE_AND, BITWISE_OR, BITWISE_XOR, and BITWISE_NOT) are good enough.
- Offset. If you've ever done anything with ADQL, you'll know that SELECT TOP 10 * FROM hipparcos.main ORDER BY parallax DESC will give you the 10 objects with the larges parallaxes. But what if you want the next but 10 closest stars? Well, OFFSET to the rescue:
```
SELECT TOP 10 *
FROM hipparcos.main
ORDER BY parallax DESC
OFFSET 10
```
  There is another, more sinister, application for OFFSET, which happens to be the actual reason I've put it into DaCHS' ADQL ages ago: Written as OFFSET 0 several databases use it to denote a barries for the query planner. This is explained to some degree in the class DaCHS TAP example Crossmatch for a Guide Star – which still mentions the first hack I had built into DaCHS to let query authors rein in overzealous query planners.
- LOWER and ILIKE. ADQL has been extremely weak on the side of text processing, so weak indeed that it wasn't nearly enough to cover the use cases for the registry when it moved to RegTAP. ADQL 2.1 adds two basic features – LOWER, a function that lets people query in a case-insensitive fashion, and ILIKE, an operator that is like LIKE, but again ignores case. While both features are obviously great as soon as people dump any kind of text (think object names) into their databases, I'm not terribly happy with ILIKE, as it does the same as RegTAP's ivoa_nocasematch user defined function, and it's always bad when a two standards forsee two different mechanisms for the same thing.
- Geometry-typed arguments. CIRCLE and POLYGON now accept POINTs in alternative constructor functions. That is, you can now say CIRCLE(POINT(ra, dec), radius) in addition to the traditional CIRCLE(ra, dec, radius). In itself, that's probably not terribly exciting, but when you have actual POINTs in your database, it's much more compact to write, say:
```
SELECT *
FROM zcosmos.data
WHERE 0=CONTAINS(
  ssa_targetpos,
  CIRCLE(ssa_location, ssa_aperture))
```
  (which would return rows for those spectra for which the declared aperture does not contain the declared target). Before, you'd had to write some fairly ugly expression involving COORD1 and whatnot in order to achieve the same effect.
- Boolean expressions. That's another one that's still a bit up in the air. First, the rough goal is to allow boolean values in ADQL-accessible tables, which so far have been a hack at best. In the future, you should be able to say WHERE is_broken=True. However, people coming from other languages will find that odd, and indeed, in python I'd cringe on if is_broken==True:. What I'd expect is if is_broken:. Do we want this in ADQL? Currently, it's in the grammar (more or less like this), but this kind of thing makes it still harder to produce useful syntax error messages. Is it worth it, either way? I'm not sure.
That about concludes my quick review of the new features of ADQL 2.1. If you'd like to know more, the current draft is on the IVOA document repository, and if you can deal with version control (you should!), you can follow the bleeding edge in the ADQL document in Volute. Discussion happens on the DAL mailing list.

Update (2018-04-13): Well, as to the CTEs, I couldn't resist after all, and they're in with DaCHS 1.1.3. And I have to say a love them -- they weren't hard to put in, and once they're there they make so many queries a good deal more readable than before. I've even put it a server-defined example for CTEs on the Heidelberg TAP service showcasing a particularly compelling use case.

Category: Standards
Space and Time not lost on the Registry

2018-02-14 Markus Demleitner

A histogram of times for which the Palomar-Leiden service has images: That's temporal service coverage right there.

If you are an astronomer and you've ever tried looking for data in the Virtual Observatory Registry, chances are you have wondered “Why can't I enter my position here?” Or perhaps “So, I'm looking for images in [NIII] – where would I go?”

Both of these are examples for the use of Space-Time Coordinates (STC) in data discovery – yes, spectral coordinates count as STC, too, and I could make an argument for it. But this post is about something else: None of this has worked in the Registry up to now.

It's time to mend this blatant omission. To take the next steps, after a bit of discussion on some of the IVOA's mailing lists, I have posted an IVOA note proposing exactly those last Thursday. It is, perhaps with a bit of over-confidence, called A Roadmap for Space-Time Discovery in the VO Registry. And I'd much appreciate feedback, in particular if you are a VO user and have ideas on what you'd like to do with such a facility.

In this post, I'd like to give a very quick run-down on what is in it for (1) VO users, (2) service operators in general, and (3) service operators who happen to run DaCHS.

First, users. We already are pretty good on spatial coverage (for about 13000 of almost 20000 resources), so it might be worth experimenting with that. For now, the corresponding table is only available on the RegTAP mirror at http://dc.g-vo.org/tap. There, you can try queries like:
```
select ivoid from
rr.table_column
natural join rr.stc_spatial
where
  1=contains(gavo_simbadpoint('HDF'), coverage)
  and ucd like 'phot.flux;em.radio%'
```
to find – in this case – services that have radio fluxes in the area of the Hubble Deep Field. If these lines scare you or you don't know what to do with the stupid ivoids, check the previous post on this blog – it explains a bit more about RegTAP and why you might care.

Similarly cool things will, hopefully, some day be possible in spectrum and time. For instance, if you were interested in SII fluxes in the crab nebula in the early sixties, you could, some day, write:
```
SELECT ivoid FROM
rr.stc_temporal
NATURAL JOIN rr.stc_spectral
NATURAL JOIN rr.stc_spatial
WHERE
  1=CONTAINS(gavo_simbadpoint('M1'), coverage)
  AND 1=ivo_interval_overlaps(
    6.69e-7, 6.75e-7,
    wavelength_start, wavelength_end)
  AND 1=ivo_interval_overlaps(
    36900, 38800,
    time_start, time_end)
```
As you can see, the spectral coordiate will, following (admittedly broken) VO convention, be given in meters of vacuum wavelength, and time in MJD. In particular the thing with the wavelength isn't quite settled yet – personally, I'd much rather have energy there. For one, it's independent of the embedding medium, but much more excitingly, it even remains somewhat sensible when you go to non-electromagnetic messengers.

A pattern I'm trying to establish is the use of the user-defined function ivo_interval_overlaps, also defined in the Note. This is intended to allow robust query patterns in the presence of two intrinsically interval-valued things: The service's coverage and the part of the spectrum you're interested in, say. With the proposed pattern, either of these can degenerate to a single point and things still work. Things only break when both the service and you figure that “Aw, Hα is just 656.3 nm” and one of you omits a digit or adds one.

But that's academic at this point, because really few resources define their coverage in time and and spectrum. Try it yourself:
```
SELECT COUNT(*) FROM (
  SELECT DISTINCT ivoid FROM rr.stc_temporal) AS q
```
(the subquery with the DISTINCT is necessary because a single resource can have multiple rows for time and spectrum when there's multiple distinct intervals – think observation campaigns). If this gives you more than a few dozen rows when you read this, I strongly suspect it's no longer 2018.

To improve this situation, the service operators need to provide the information on the coverage in their resource records. Indeed, the registry schemas already have the notion of a coverage, and the Note, in its core, simply proposes to add three elements to the coverage element of VODataService 1.1. Two of these new elements – the coverage in time and space – are simple floating-point intervals and can be repeated in order to allow non-contiguous coverage. The third element, the spatial coverage, uses a nifty data structure called a MOC, which expands to “HEALPix Multi-Order Coverage map” and is the main reason why I claim we can now pull off STC in the Registry: MOCs let databases and other programs easily and quickly manipulate areas on the sphere. Without MOCs, that's a pain.

So, if you have registry records somewhere, please add the elements as soon as you can – if you don't know how to make a MOC: CDS' Aladin is there to help. In the end, your coverage elements should look somewhat like this:
```
<coverage>
  <spatial>3/336,338,450-451,651-652,659,662-663
    4/1816,1818-1819,1822-1823,1829,1840-1841</spatial>
  <temporal>37190 37250</temporal>
  <temporal>54776 54802</temporal>
  <spectral>3.3e-07 6.6e-07</spectral>
  <spectral>2.0e-05 3.5e-06</spectral>
  <waveband>Optical</waveband>
  <waveband>Infrared</waveband>
</coverage>
```
The waveband elements are remainders from VODataService 1.1. They are still in use (prominently, for one, in SPLAT), and it's certainly still a good idea to keep giving them for the forseeable future. You can also see how you would represent multiple observing campaigns and different spectral ranges.

Finally, if you're running DaCHS and you're using it to generate registry records (and there's almost no excuse for not doing so), you can simply write a coverage element into your RD starting with DaCHS 1.2 (or, if you run betas, 1.1.1, which is already available). You'll find lots of examples at the usual place. As a relatively interesting example, the resource descriptor of plts. It has this:
```
  <updater spaceTable="data" spectralTable="data" mocOrder="4"/>
  <spectral>3.3e-07 6.6e-07</spectral>
  <temporal>37190 37250</temporal>
  <temporal>38776 38802</temporal>
  <temporal>41022 41107</temporal>
  <temporal>41387 41409</temporal>
  <temporal>41936 41979</temporal>
  <temporal>43416 43454</temporal>
  <spatial>3/282,410 4/40,323,326,329,332,387,390,396,648-650,1083,1085,1087,1101-1103,1123,1125,1132-1134,1136,1138-1139,1144,1146-1147,1173-1175,1216-1217,1220,1223,1229,1231,1235-1236,1238,1240,1597,1599,1614,1634,1636,1728,1730,1737,1739-1740,1765-1766,1784,1786,2803,2807,2809,2812</spatial>
</coverage>
```
This particular service archives plate scans from the Palomar-Leiden Trojan surveys; these were looking for Trojan asteroids (of Jupiter) using the Palomar 122 cm Schmidt and were conducted in several shortish campaigns between 1960 and 1977 (incidentally, if you're looking for things near the Ecliptic, this stuff might still hold valuable insights for you). Because the fill factor for the whole time period is rather small, I manually extracted the time coverage; for that, I ran select dateobs from plts.data via TAP and made the histogram plot above. Zooming in a bit, I read off the limits in TOPCAT's coordinate display.

The other coverages, however, were put in automatically by DaCHS. That's what the updater element does: for each axis, you can say where DaCHS should look, and it will then fill in the appropriate data from what it guesses gives the relevant coordiantes – that's straightforward for standard tables like the ones behind SSAP and SIAP services (or obscore tables, for that matter), perhaps a bit more involved otherwise. To say “just do it for all axis”, give the updater a single sourceTable attribute.

Finally, in this case I'm overriding mocOrder, the order down to which DaCHS tries to resolve spatial features. I'm doing this here because in determining the coverage of image services DaCHS right now only considers the centers of the images, and that's severely underestimating the coverage here, where the data products are the beautiful large Schmidt plates. Hence, I'm lowering the resolution from the default 6 (about one degree linearly) to still give some approximation to the actual data coverage. We'll fix the underlying deficit as soon as pgsphere, the postgres extension which is actually dealing with all the MOCs, has support for turning circles and polygons into MOCs.

When you have defined an updater, just run dachs limits q.rd, and DaCHS will carefully (preserving your indentation) re-write the RD to contain what DaCHS has worked out from your table (but careful: it will overwrite what was previously there; so, make sure you only ask DaCHS to only deal with axes you're not dealing with manually).

If you feel like writing code discovering holes in the intervals, ideally already in the database: that would be great, because the tighter the intervals defined, the fewer false positives people will have in data discovery.

The take-away for DaCHS operators is:
1. Add STC coverage to your resources as soon as you've updated to DaCHS 1.2
2. If you don't have to have the tightest coverage declaration conceivable, all you have to do to have that is add:
```
<coverage>
  <updater sourceTable="my_table"/>
</coverage>
```
  to your RD (where my_table is the id of your service's “main” table) and then run dachs limits q.rd
3. For special effects and further information, see Coverage Metadata in the DaCHS reference documentation
4. If you have a nice postgres function that splits a simple coverage interval up so the filling factor of a set of new intervals increases (or know a nice, database-compatible algorithm to do so) – please let me know.
Category: Standards
Time Series

2017-05-03 Markus Demleitner

The IOVA's committee on science priorities (CSP) has declared the “time domain” as one of its focus topics quite a while ago, an action boiling down to a call to the IVOA member projects to think about support for time series and their analysis in services, standards, and clients.

While for several years, response has been lackluster, work on time series has gathered quite a bit of steam recently. For instance, the spectral client SPLAT (co-maintained by GAVO) has grown some preliminary support to properly display time series (very rudimentary in what's currently released), and lively discussions on proper metadata for time series have been going on on the Data Models mailing list of the IVOA – if you're interested in the time domain, this would be a good time to subscribe for a while and comment as appropriate.

Meanwhile, in our Heidelberg data center, we've joined the fray by publishing our first time series service (science background: searching for exoplanets in the Milky Way bulge using gravitational lensing), which is available through SSA (look for k2c9vst) and through ObsCore (at http://dc.g-vo.org/tap, collection name k2c9vst), too. For details see also the service info.

Since right now future standards are being worked out, this is a perfect time to publish your time series; this way you get to influence what people will be able to tell machines about their time series in the next couple of years. Ask our staff (contact below) if you want us to publish for you. But you can also self-publish using the DaCHS publication package. Refer to the resource descriptor of the k2c9vst service to get started.

At its heart is the table definition of the time series, which is basically:
```
<table id="instance">
  <column name="hjd" type="double precision"
      unit="d" ucd="time.epoch"
      tablehead="Time"
      description="Time this photometry corresponds to."
      verbLevel="1"/>
  <column name="df" type="double precision"
      unit="adu" ucd="phot.flux"
      tablehead="Diff. Flux"
      description="Difference as defined by 2008MNRAS.386L..77B"
      verbLevel="1"/>
  <column name="e_df"
      unit="adu" ucd="stat.error;phot.flux"
      tablehead="Err. DF"
      description="Error in difference flux."
      verbLevel="15"/>
</table>
```
– in the actual service, there are a few more columns, but time, value, and error actually make up a full time series.

Except that a machine can't really tell what this is yet (well, perhaps it could using UCDs, but that's a different matter). What it needs to work out is what's the independent axis, what the frames are, etc. And to do that, the machine needs annotation, i.e., machine-readable, structured declarations alongside the data and the “classic” metadata like units and descriptions.

In actual VOTables, this will be happening through VO-DML annotation, which is also still seriously being discussed; whatever we currently spit out you can inspect in the XML source of this example document.

DaCHS, however, isolates you from the concrete details of writing VOTables. Instead, you write annotations in a JSON-inspired little language we've christened SIL (“Simple Instance Language”; reference). The complicated part is to know what types and attributes you have to declare, which is exactly what the data models is a bout. As said initially, the details are still in flux here, but this is what things look like right now:
```
<dm>
  (ivoa:Measurement) {
    value: @df
    statError: @e_df
  }
</dm>

<dm>
  (stc2:Coords) {
    time: (stc2:Coord) {
      frame:
        (stc2:TimeFrame) {
          timescale: UTC
          refPosition: BARYCENTER
          kind: JD }
      loc: @hjd
    }
    space:
      (stc2:Coord) {
        frame:
          (stc2:SpaceFrame) {
            orientation: ICRS
            epoch: "J2000.0"
          }
        loc: [@raj2000 @dej2000]
    }
  }
</dm>

<dm>
  (ndcube:Cube) {
    independent_axes: [@hjd]
    dependent_axes: [@df @mag]
  }
</dm>
```
If you consider this for a moment, you'll see that each dm element corresponds to something like an object template of a certain “type”. The first, for instance, defines a measurement with a value and a statistical error. Both happen to be given as references to columns in the table defined above (as indicated by the @ signs).

The last annotation defines a data cube; a time series in this definition is simply a data cube with just a single non-degenerate independently varying axis (the independent_axis attribute; in the value the square brackets indicate a sequence) that happens to be time-like. And that hjd is time-like, VO-DML enabled clients will work out when interpreting the STC (“Space-Time-Coordinates”) annotation. In there, you will see that hjd is referenced from the time attribute and with a time-like frame that also defines that this particular flavor of HJD is what a hypothetical clock at the solar system's barycenter would measure if it stood in the gravitational potential in Greenwhich, and had leap seconds thrown in now and then. And that long story is communicated through “literals”, constant strings like “BARYCENTER” or ”TT”, which are also legal within DaCHS data model annotations.

This may seem a bit complicated at first. I argue, though, that given what time series clients will have to do anyway, going through the cube and STC annotations is actually about the most straightforward thing you can do.

But perhaps I'm wrong, so again: None of this is cast in stone right now. Comments are even more welcome than usual, either below or at gavo@ari.uni-heidelberg.de.

Category: Standards
ProvenanceDM Working Draft released

2016-11-22 Kristin Riebe

We've released the first version of working draft for the IVOA Provenance Data Model at the IVOA documents page:

ProvenanceDM Working draft.

Updated versions will be put at the same URL (check the date! The first version is from 21st November 2016).

Want to get your hands on the very latest version? Check out the volute svn repository! Since it's not so easy to find what you want there, here's the path to the Provenance Data Model at volute, and here's a direct link to the latest development draft [pdf].

We're happy to receive some feedback on the document via IVOA's data modelling mailing list dm@ivoa.net.

Category: Standards

« Page 2 / 3 »